Can we Quantify Domainhood?
Exploring Measures to Assess Domain-Specificity in Web Corpora
Marina Santini
marina.santini@ri.se
RISE Research Institutes of Sweden
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden TIR 2018, Regensburg, Germany, 4 Sept. 2018
Acknowledgement and Citation
Acknowledgement
This research was supported by E-care@home, a “SIDUS - Strong Distributed
Research Environment” project, funded by the Swedish Knowledge Foundation.
Project websit: <<http://ecareathome.se/>>.
Cite the paper as:
Santini M., Strandqvist W., Nyström M., Alirezai M., Jönsson A. (2018) Can We
Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web
Corpora. In: Elloumi M. et al. (eds) Database and Expert Systems Applications. DEXA
2018. Communications in Computer and Information Science, vol 903. Springer, Cham
DOI https://doi.org/10.1007/978-3-319-99133-7_17
Presented at : TIR 2018: 15th International Workshop on Technologies for Information
Retrieval. In conjunction with DEXA 2018. Regensburg, Germany, 4 Sept. 2018.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 2TIR 2018, Regensburg, Germany, 4 Sept. 2018
Outline
1. Research questions: evaluating specialized web corpora in
terms of ”domainhood”
2. Case study: a web corpus for eCare
3. Methodology: how to measure domainhood
4. Conclusion & Future work
TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 3
1. Evaluating specialized
web corpora in terms of
”domainhood”
Introduction and Research Questions
TIR 2018, Regensburg,
Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 4
Web Corpora
• Web corpora are important
• The evaluation of web corpora is important
• The evaluation of general-purpose web corpora is advanced
• The evaluation of specialized web corpora is less advanced
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 5TIR 2018, Regensburg, Germany, 4 Sept. 2018
Quantitative Corpus Evaluation
“When will a grammar based on one corpus be valid for another?
How much will it cost to port a Natural Language Processing (NLP)
application from one domain, with one corpus, to another, with
another?”
Adam Kilgarriff (2001) Comparing corpora. Int. J. Corpus Linguist. 6(1), 97–133
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 6TIR 2018, Regensburg, Germany, 4 Sept. 2018
Definition: domainhood
• Domainhood is the degree of domain representativeness or
domain specificity of a web corpus.
• Ex: a high frequency of medical terms is a sign that the corpus is a
specialized medical corpus
• The importance of domain granularity
• Coarse domains vs fine-grained domains
• Lippincott et al. (2011) “while variation at a coarser domain level such as
between newswire and biomedical text is well-studied and known to affect the
portability of NLP systems, there is a need to develop an awareness of
subdomain variation when considering the practical use of language processing
applications […]”.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 7TIR 2018, Regensburg, Germany, 4 Sept. 2018
Research Questions:
Quantifying domainhood
• ”is it possible to automatically quantify the domainhood of a web
corpus regardless its domain granularity? If so, how?”
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 8TIR 2018, Regensburg, Germany, 4 Sept. 2018
2. Case Study: a Web
Corpus for eCare
TIR 2018, Regensburg,
Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 9
eCare_sv_01
• eCare_sv_01*
• 155 SNOMED CT terms (chronic diseases)
* Santini M., Jönsson A., Nystrom M. and Alirezai M. (2017) "A Web Corpus for eCare: Collection, Lay
Annotation and Learning. First Results". Proceedings of LTA'17, FedCSIS 2017, Prague.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 10TIR 2018, Regensburg, Germany, 4 Sept. 2018
3. Methodology: How to
Measure Domainhood
Which measures?
TIR 2018, Regensburg,
Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 11
SUC & eCare_sv_01
Stockholm-Umeå Corpus (SUC) -> reference corpus (1 million words)
eCare_sv_01: domain-specific corpus (approx. 700 000 words)
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 12TIR 2018, Regensburg, Germany, 4 Sept. 2018
Metrics
1.Mann-Withney-Wilcoxon Test
2.Kendall correlation coefficient (τ )
3.Kullback–Leibler (KL) divergence
4.Log-likelihood
5.Burstiness
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 13TIR 2018, Regensburg, Germany, 4 Sept. 2018
Gold Standard
Tokenized gold standard (165 unigrams)
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 14TIR 2018, Regensburg, Germany, 4 Sept. 2018
Seeds (example):
atrofisk faryngit
atrofisk gastrit
Gold Standard (example)
atrofisk
faryngit
gastrit
Word Frequency Lists
“A word frequency list is a “compact representation of a corpus,
lacking much of the information in the corpus but small and
easily tractable.”
Adam Kilgarriff (2010). Comparable corpora within and across languages, word frequency lists and
the KELLY project. In: Proceedings of the 3rd Workshop on Building and Using Comparable Corpora.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 15TIR 2018, Regensburg, Germany, 4 Sept. 2018
Ranked Word Frequencies: Scatter Plot
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 16TIR 2018, Regensburg, Germany, 4 Sept. 2018
Mann-Withney-Wilcoxon Test: Theory
Non-parametric test:
Using the Mann-Whitney-Wilcoxon Test, we can decide whether the
population distributions are identical without assuming them to
follow the normal distribution.
If the two distributions are dissimilar at .05 significance level, we can
conclude that SUC and eCare come from different populations.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 17TIR 2018, Regensburg, Germany, 4 Sept. 2018
Mann-Withney-Wilcoxon Test: Results
• The null hypothesis is that SUC's word frequency list and
eCare_sv_01 word frequency list come from identical populations.
• To test the hypothesis, we apply the wilcox.test() [R function] to
compare the corpora.
• The p-value turns out to be 0.019, and is less than the .05
significance level, we reject the null hypothesis.
• Conclusion: at .05 significance level, we conclude that SUC and eCare
belong to non-identical populations.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 18TIR 2018, Regensburg, Germany, 4 Sept. 2018
Kendall correlation coefficient: Theory
Kendall correlation coefficient (tau) is a non-parametric measure of
correlation between two rankings.
tau is a probability value which indicates the difference between 2
rankings.
(We used the R function “cor.test()” with method=“kendall” to calculate
the test).
Interpretation:
• -1 = strong negative correlation
• 0 = no association
• 1 strong positive correlation
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 19TIR 2018, Regensburg, Germany, 4 Sept. 2018
Kendall correlation coefficient: Results
Null hypothesis: the two rankings are identical
(We used the “cor.test()” R function with method=“kendall”, "two.sided“ a
to calculate the test.
tau -0.1093077;
the p-value of the test is 0.000000003122 (p-value in R: 3.122e−09) which
is less than the significance level p = .05.
We reject the null hypothesis:
If the rankings of SUC and eCare’s word frequency lists are dissimilar at .05
significance level, we can conclude that the content of eCare is different
from SUC.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 20TIR 2018, Regensburg, Germany, 4 Sept. 2018
Kullback–Leibler (KL) Divergence: Theory
(a.k.a. relative entropy)
• KL quantifies how “distant” an estimation of a distribution may be
from the true distribution.
• Interpretation: KL divergence is non-negative and equal to zero if the
two distributions are identical.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 21TIR 2018, Regensburg, Germany, 4 Sept. 2018
Kullback–Leibler (KL) Divergence: Results
• (We do not need a null hypothesis)
• (We used the R function “KL.empirical()”, (log2), package “entropy”
to compute KL divergence).
The KL divergence between SUC and eCare_Sv_01 is 5.80
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 22TIR 2018, Regensburg, Germany, 4 Sept. 2018
Up to now...
• ... the gold standard was not involved
• It is confirmed that two corpora were largely different, but we do not
know whether eCare is representative of the target domain.
TIR 2018, Regensburg, Germany, 4 Sept.
2018
RISE Research Institutes of Sweden, Division ICT - RISE SICS
East, Sweden
23
Log-Likelihood (LL): Theory
(a.k.a. G2)
A reference corpus is needed.
It is a measure based on a contingency table and compares the expected
values in two corpora under observation.
Interpretation: The larger the LL score of a word, the more different its
distribution in the two corpora.
A LL score of 3.8415 or higher is significant at the level of <0.05 and a LL score
of 10.8276 is significant at the level of <0.001 (Desagulier, 2017).
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 24TIR 2018, Regensburg, Germany, 4 Sept. 2018
Log-Likelihood (LL): Results
The intersection between LL scores and the gold standard is 58, i.e. 35.15%.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 25TIR 2018, Regensburg, Germany, 4 Sept. 2018
Burstiness: Theory
Burstiness helps identify words that are frequent in certain
documents, but that are unevenly distributed in the corpus as a
whole.
Implementation in R:
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 26TIR 2018, Regensburg, Germany, 4 Sept. 2018
”Burstiness is like the mean
but it ignores documents
with no intances” (Church
and Gale, 1995) Irvine, A., & Callison-Burch, C. A (2017)
Comprehensive Analysis of Bilingual Lexicon
Induction. Computational Linguistics, 43(2).
Burstiness: Results
Comparison between bursty words
and the chronic diseases’ gold standard
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 27TIR 2018, Regensburg, Germany, 4 Sept. 2018
Discussion
• Both statistical tests confirm that the two corpora are weakly correlated.
No gold standard involved, but based on a Null Hypothesis
• KL divergence returns a large value that indicate that the two corpora are
distant from each other. No gold standard involved.
• LL scores needs a reference corpus. They single out words with different
distributions in two corpora, results are compared against a gold standard,
but it is not clear to which corpus the the words that are singled out belong
to.
• Burstiness can be computed without a reference corpus. Results can be
measured against a gold standard. Provides promising results.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 28TIR 2018, Regensburg, Germany, 4 Sept. 2018
Profiling Bursty Words: Open Issues
• Less empirical cut-off points.
• Is burstiness affected by the size of corpus?
• Evaluation metrics (overlap coefficients and precision@) are not so
indicative. Intersection gives a better idea of the quantification.
• The best way to test the design of gold standards (=target domains)
for this kind of experiments.
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 29TIR 2018, Regensburg, Germany, 4 Sept. 2018
4. Conclusion and
Future Work
What next?
TIR 2018, Regensburg,
Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 30
Conclusion
Mann-Withney-Wilcoxon Test: hypothesis testing on distributions
Kendall correlation coefficient: hypothesis testing on rank correlation
Kullback–Leibler (KL) divergence: requires a reference corpus, cannot be
tested on a gold standard
Log-likelihood: requires a reference corpus, can be tested on a gold standard
Burstiness: does not require a reference corpus and can be tested on a gold
standard
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 31TIR 2018, Regensburg, Germany, 4 Sept. 2018
Future Work
• Implementation of additional burstiness formulas
• Inclusion of multi-words in the frequency lists
• Application of burstiness for domainhood on larger corpora
and other languages
• Investigating the ideal design of a gold standard for
domainhood detection
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 32TIR 2018, Regensburg, Germany, 4 Sept. 2018
Thanks for your attention !
RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 33
Any Questions ?
TIR 2018, Regensburg, Germany, 4 Sept.
2018

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

  • 1.
    Can we QuantifyDomainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora Marina Santini [email protected] RISE Research Institutes of Sweden RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 2.
    Acknowledgement and Citation Acknowledgement Thisresearch was supported by E-care@home, a “SIDUS - Strong Distributed Research Environment” project, funded by the Swedish Knowledge Foundation. Project websit: <<http://ecareathome.se/>>. Cite the paper as: Santini M., Strandqvist W., Nyström M., Alirezai M., Jönsson A. (2018) Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora. In: Elloumi M. et al. (eds) Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham DOI https://doi.org/10.1007/978-3-319-99133-7_17 Presented at : TIR 2018: 15th International Workshop on Technologies for Information Retrieval. In conjunction with DEXA 2018. Regensburg, Germany, 4 Sept. 2018. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 2TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 3.
    Outline 1. Research questions:evaluating specialized web corpora in terms of ”domainhood” 2. Case study: a web corpus for eCare 3. Methodology: how to measure domainhood 4. Conclusion & Future work TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 3
  • 4.
    1. Evaluating specialized webcorpora in terms of ”domainhood” Introduction and Research Questions TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 4
  • 5.
    Web Corpora • Webcorpora are important • The evaluation of web corpora is important • The evaluation of general-purpose web corpora is advanced • The evaluation of specialized web corpora is less advanced RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 5TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 6.
    Quantitative Corpus Evaluation “Whenwill a grammar based on one corpus be valid for another? How much will it cost to port a Natural Language Processing (NLP) application from one domain, with one corpus, to another, with another?” Adam Kilgarriff (2001) Comparing corpora. Int. J. Corpus Linguist. 6(1), 97–133 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 6TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 7.
    Definition: domainhood • Domainhoodis the degree of domain representativeness or domain specificity of a web corpus. • Ex: a high frequency of medical terms is a sign that the corpus is a specialized medical corpus • The importance of domain granularity • Coarse domains vs fine-grained domains • Lippincott et al. (2011) “while variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, there is a need to develop an awareness of subdomain variation when considering the practical use of language processing applications […]”. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 7TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 8.
    Research Questions: Quantifying domainhood •”is it possible to automatically quantify the domainhood of a web corpus regardless its domain granularity? If so, how?” RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 8TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 9.
    2. Case Study:a Web Corpus for eCare TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 9
  • 10.
    eCare_sv_01 • eCare_sv_01* • 155SNOMED CT terms (chronic diseases) * Santini M., Jönsson A., Nystrom M. and Alirezai M. (2017) "A Web Corpus for eCare: Collection, Lay Annotation and Learning. First Results". Proceedings of LTA'17, FedCSIS 2017, Prague. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 10TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 11.
    3. Methodology: Howto Measure Domainhood Which measures? TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 11
  • 12.
    SUC & eCare_sv_01 Stockholm-UmeåCorpus (SUC) -> reference corpus (1 million words) eCare_sv_01: domain-specific corpus (approx. 700 000 words) RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 12TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 13.
    Metrics 1.Mann-Withney-Wilcoxon Test 2.Kendall correlationcoefficient (τ ) 3.Kullback–Leibler (KL) divergence 4.Log-likelihood 5.Burstiness RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 13TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 14.
    Gold Standard Tokenized goldstandard (165 unigrams) RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 14TIR 2018, Regensburg, Germany, 4 Sept. 2018 Seeds (example): atrofisk faryngit atrofisk gastrit Gold Standard (example) atrofisk faryngit gastrit
  • 15.
    Word Frequency Lists “Aword frequency list is a “compact representation of a corpus, lacking much of the information in the corpus but small and easily tractable.” Adam Kilgarriff (2010). Comparable corpora within and across languages, word frequency lists and the KELLY project. In: Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 15TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 16.
    Ranked Word Frequencies:Scatter Plot RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 16TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 17.
    Mann-Withney-Wilcoxon Test: Theory Non-parametrictest: Using the Mann-Whitney-Wilcoxon Test, we can decide whether the population distributions are identical without assuming them to follow the normal distribution. If the two distributions are dissimilar at .05 significance level, we can conclude that SUC and eCare come from different populations. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 17TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 18.
    Mann-Withney-Wilcoxon Test: Results •The null hypothesis is that SUC's word frequency list and eCare_sv_01 word frequency list come from identical populations. • To test the hypothesis, we apply the wilcox.test() [R function] to compare the corpora. • The p-value turns out to be 0.019, and is less than the .05 significance level, we reject the null hypothesis. • Conclusion: at .05 significance level, we conclude that SUC and eCare belong to non-identical populations. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 18TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 19.
    Kendall correlation coefficient:Theory Kendall correlation coefficient (tau) is a non-parametric measure of correlation between two rankings. tau is a probability value which indicates the difference between 2 rankings. (We used the R function “cor.test()” with method=“kendall” to calculate the test). Interpretation: • -1 = strong negative correlation • 0 = no association • 1 strong positive correlation RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 19TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 20.
    Kendall correlation coefficient:Results Null hypothesis: the two rankings are identical (We used the “cor.test()” R function with method=“kendall”, "two.sided“ a to calculate the test. tau -0.1093077; the p-value of the test is 0.000000003122 (p-value in R: 3.122e−09) which is less than the significance level p = .05. We reject the null hypothesis: If the rankings of SUC and eCare’s word frequency lists are dissimilar at .05 significance level, we can conclude that the content of eCare is different from SUC. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 20TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 21.
    Kullback–Leibler (KL) Divergence:Theory (a.k.a. relative entropy) • KL quantifies how “distant” an estimation of a distribution may be from the true distribution. • Interpretation: KL divergence is non-negative and equal to zero if the two distributions are identical. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 21TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 22.
    Kullback–Leibler (KL) Divergence:Results • (We do not need a null hypothesis) • (We used the R function “KL.empirical()”, (log2), package “entropy” to compute KL divergence). The KL divergence between SUC and eCare_Sv_01 is 5.80 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 22TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 23.
    Up to now... •... the gold standard was not involved • It is confirmed that two corpora were largely different, but we do not know whether eCare is representative of the target domain. TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 23
  • 24.
    Log-Likelihood (LL): Theory (a.k.a.G2) A reference corpus is needed. It is a measure based on a contingency table and compares the expected values in two corpora under observation. Interpretation: The larger the LL score of a word, the more different its distribution in the two corpora. A LL score of 3.8415 or higher is significant at the level of <0.05 and a LL score of 10.8276 is significant at the level of <0.001 (Desagulier, 2017). RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 24TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 25.
    Log-Likelihood (LL): Results Theintersection between LL scores and the gold standard is 58, i.e. 35.15%. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 25TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 26.
    Burstiness: Theory Burstiness helpsidentify words that are frequent in certain documents, but that are unevenly distributed in the corpus as a whole. Implementation in R: RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 26TIR 2018, Regensburg, Germany, 4 Sept. 2018 ”Burstiness is like the mean but it ignores documents with no intances” (Church and Gale, 1995) Irvine, A., & Callison-Burch, C. A (2017) Comprehensive Analysis of Bilingual Lexicon Induction. Computational Linguistics, 43(2).
  • 27.
    Burstiness: Results Comparison betweenbursty words and the chronic diseases’ gold standard RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 27TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 28.
    Discussion • Both statisticaltests confirm that the two corpora are weakly correlated. No gold standard involved, but based on a Null Hypothesis • KL divergence returns a large value that indicate that the two corpora are distant from each other. No gold standard involved. • LL scores needs a reference corpus. They single out words with different distributions in two corpora, results are compared against a gold standard, but it is not clear to which corpus the the words that are singled out belong to. • Burstiness can be computed without a reference corpus. Results can be measured against a gold standard. Provides promising results. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 28TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 29.
    Profiling Bursty Words:Open Issues • Less empirical cut-off points. • Is burstiness affected by the size of corpus? • Evaluation metrics (overlap coefficients and precision@) are not so indicative. Intersection gives a better idea of the quantification. • The best way to test the design of gold standards (=target domains) for this kind of experiments. RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 29TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 30.
    4. Conclusion and FutureWork What next? TIR 2018, Regensburg, Germany, 4 Sept. 2018 RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden 30
  • 31.
    Conclusion Mann-Withney-Wilcoxon Test: hypothesistesting on distributions Kendall correlation coefficient: hypothesis testing on rank correlation Kullback–Leibler (KL) divergence: requires a reference corpus, cannot be tested on a gold standard Log-likelihood: requires a reference corpus, can be tested on a gold standard Burstiness: does not require a reference corpus and can be tested on a gold standard RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 31TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 32.
    Future Work • Implementationof additional burstiness formulas • Inclusion of multi-words in the frequency lists • Application of burstiness for domainhood on larger corpora and other languages • Investigating the ideal design of a gold standard for domainhood detection RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 32TIR 2018, Regensburg, Germany, 4 Sept. 2018
  • 33.
    Thanks for yourattention ! RISE Research Institutes of Sweden, Division ICT - RISE SICS East, Sweden Slide: 33 Any Questions ? TIR 2018, Regensburg, Germany, 4 Sept. 2018