SlideShare a Scribd company logo
See	discussions,	stats,	and	author	profiles	for	this	publication	at:	https://www.researchgate.net/publication/280561259
SIAHAN2015	CNER	PPT	copy	5
Data	·	July	2015
CITATIONS
0
READS
9
4	authors,	including:
Some	of	the	authors	of	this	publication	are	also	working	on	these	related	projects:
Machine	Translation	View	project
Xiaodong	Zeng
University	of	Macau
14	PUBLICATIONS			52	CITATIONS			
SEE	PROFILE
Derek	F.	Wong
University	of	Macau
114	PUBLICATIONS			311	CITATIONS			
SEE	PROFILE
Aaron	L.-F.	Han
Dublin	City	University
38	PUBLICATIONS			48	CITATIONS			
SEE	PROFILE
All	content	following	this	page	was	uploaded	by	Aaron	L.-F.	Han	on	30	July	2015.
The	user	has	requested	enhancement	of	the	downloaded	file.
Chinese Named Entity Recognition with Graph-based
Semi-supervised Learning Model
Aaron Li-Feng Han* Xiaodong Zeng+ Derek F. Wong+ Lidia S. Chao+
* ILLC, University of Amsterdam, The Netherlands
+ NLP2CT Laboratory, University of Macau, Macau S.A.R., China
SIGHAN 2015 @ Beijing, July 30-31
• NER Tasks
• Traditional methods
• Motivation
• Designed model
• Experiments
• Conclusion
NER Tasks
NER tasks:
The annotations in MUC-7 Named Entity tasks (Marsh and
Perzanowski, 1998)
- Entities (organization, person, and location), times and quantities
such as monetary values and percentages
- Languages of English, Chinese and Japanese.
The entity categories in CONLL-02 (Tjong Kim Sang, 2002) and
CONLL-03 (Tjong Kim Sang and De Meulder, 2003) NER shared tasks
- Persons, locations, organizations and names of miscellaneous entities
- Languages span from Spanish, Dutch, English, to German.
The SIGHAN bakeoff-3 (Levow, 2006) and bakeoff-4 (Jin and Chen,
2008) tasks offer standard Chinese NER (CNER) corpora
- Three commonly used entities, i.e., personal names, location names,
and organization names.
Traditional methods
Traditional methods used for the entity recognition tend to employ
external annotated corpora to enhance the machine learning stage, and
improve the testing scores using the enhanced models (Zhang et al.,
2006; Mao et al., 2008; Yu et al., 2008).
The conditional random filed (CRF) models have shown advantages
and good performances in CNER tasks as compared with other machine
learning algorithms (Zhou et al., 2006; Zhao and Kit, 2008)
However, the annotated corpora are generally very expensive and time
consuming.
On the other hand, there are a lot of freely available unlabeled data in
the internet that can be used for our researches.
Due to this reason, some researchers begin to explore the usage of the
unlabeled data and the semi-supervised learning methods based on
labeled training data and unlabeled external data have shown their ad-
vantages (Blum and Chawla, 2001; Shin et al., 2006; Zha et al., 2008;
Zhang et al., 2013).
Motivation
Named entity recognition (NER) plays an important role in the NLP literature
The traditional methods tend to employ large annotated corpus to achieve a high performance.
- However the annotated corpus is usually expensive to gain.
Many semi-supervised learning models for NER task are proposed to utilize the freely available
unlabeled data.
Graph-based semi-supervised learning (GBSSL) methods have been employed in many NLP tasks,
e.g.
- sentiment categorization (Goldberg and Zhu, 2006)
- question- answering (Celikyilmaz et al., 2009)
- class-Instance acquisition (Talukdar and Pereira, 2010)
- structured tagging models (Subramanya et al., 2010)
- Joint Chinese word segmentation and part of speech (POS) tagging (Zeng et al., 2013)
Why not:
GBSSL for NER
Extend the labeled data
Enhance the learning model , e.g.conditional random field (CRF)
Designed model
• Enhanced learning with unlabelled data
• GBSSL to extend labeled data
• CRF learning
• Graph based semi-supervised learning
• - Graph Construction & Label Propagation
• Graph Construction
• - Follow the research of Subramanya et al.
(2010), represent the vertices using character
trigrams in labeled and unlabeled sentences for
graph construction
• - A symmetric k-NN graph is utilized with the
edge weights calculated by a symmetric
similarity function designed by Zeng et al.
(2013).
• where k(i) is the k nearest neighbours of xi. sim() is
a similarity measure of two vertices. The similarity is
computed based on the co-occurrence statistics
(Zeng et al., 2013).
wi, j =
sim(xi ,xj ) if j ∈k(i) or i ∈k( j)
0 otherwise
⎧
⎨
⎪
⎩
⎪
The feature set we employed to measure the similarity of two vertices based on the co-occurrence
statistics is the optimized one by Han et al. (2013) for CNER tasks, as denoted in Table 1.
• Label propagation
After the graph construction on both labeled and unlabeled data
- Use the sparsity inducing penalty (Das and Smith, 2012) label
propagation algorithm to induce trigram level label distributions from
the constructed graph
- Based on the Junto toolkit (Talukdar and Pereira, 2010).
Enhance CRF learning
- put the propagated labeled data into the training data
- based on the CRF++ toolkit
- Use the same feature set as in Table 1.
Experiments
Dataset
- We employ the SIGHAN bakeoff-3 (Levow, 2006) MSRA
(Microsoft research of Asia) training and testing data as standard
setting.
- To testify the effectiveness of the GBSSL method for CRF model in
CNER tasks, we utilize some plain (un- annotated) text from
SIGHAN bakeoff-2 (Emerson, 2005) and bakeoff-4 (Jin and Chen,
2008) as external unlabeled data.
- The data set is introduced in Table 2 from the aspect of sentence
number.
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
We set two baseline scores for the evaluation.
- One baseline is the simple left-to-right maximum matching model
(MaxMatch) based on the training data
- Another baseline is the closed CRF model (Closed-CRF) without
using unlabeled data.
- The employment of GBSSL model into semi-supervised CRF
learning is denoted as GBSSL-CRF.
• Training costs
- The comparison shows that the extracted features grow from
8,729,098 to 11,336,486 (29.87%) due to the external dataset, and the
corresponding iterations and training hours also grow by 12.86% and
77.04% respectively.
- The training costs of the CRF learning stage are detailed in Table 3.
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
• Evaluation results
- The evaluation results are shown in Table 4, from the aspects of
recall, precision and the harmonic mean of recall and precision (F1-
score).
- The evaluation shows that both the Closed-CRF and GBSSL-CRF
models have largely outperformed baseline-1 model (MaxMatch).
- As compared with the Closed-CRF model, the GBSSL- CRF model
yielded a higher performance in precision score, a lower performance
in recall score, and finally resulted in a faint improvement in F1 score.
- Both the GBSSL-CRF and Closed-CRF show higher performance in
precision and lower performance in recall value.
To look inside the GBSSL performance on each kind of entity
- denote the detailed evaluation results from the aspect of F1-score in
Table 5.
- The detailed evaluation from three kinds of entities shows that both
the GBSSL-CRF and Closed-CRF show higher performance in LOC
entity type, and lower performance in PER and ORG entities.
- Fortunately, the GBSSL model can enhance the CRF learning on the
two kinds of difficult entities PER and ORG with the better perfor-
mances of 0.28% and 0.58% respectively. However, the GBSSL
model decreases the F1 score on LOC entity by 0.19%.
The lower performance of GBSSL model on LOC entity may be due to
that the unlabelled data is only as much as 62.75% of the training
corpus, which is not large enough to cover the Out-of-Vocabulary
(OOV) testing words of LOC entity; on the other hand, the unlabeled
data also bring some noise into the model.
Conclusion and future works
This paper makes an effort to see the effectiveness of the GBSSL model
for the traditional CNER task.
Extend data
- The experiments verify that the GBSSL can enhance the state-of-the-
art CRF learning models. The improvement score is a little weak
because the unlabeled data is not large enough. In the future work, we
decide to use larger unlabeled dataset to enhance the CRF learning
model.
Feature selection:
The feature set optimized for CRF learning may be not the best one for
the similarity calculation in graph construction stage.
So we will make efforts to select the best feature set for the measuring
of vertices similarity in graph construction on CNER documents.
Corpus type:
In this paper, we utilized the Microsoft research of Asia corpus for
experiments.
We will use more kinds of Chinese corpora for testing, such as CITYU
and LDC corpus, etc.
Performance on OOV words:
The GBSSL model generally improves the tagging accuracy of the Out-
of-Vocabulary OOV) words in the test data, which are unseen in the
training corpora.
In the future work, we plan to give a detailed analysis of the GBSSL
model performance on the OOV words for CNER tasks.
• Thanks for your attention!
View publication statsView publication stats

More Related Content

PDF
A literature survey of benchmark functions for global optimisation problems
Xin-She Yang
 
PDF
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
cscpconf
 
PDF
Preliminary Exam Slides
Debasmit Das
 
PDF
IRJET- Survey of Feature Selection based on Ant Colony
IRJET Journal
 
PDF
Predicting best classifier using properties of data sets
Abhishek Vijayvargia
 
PPTX
Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...
Debasmit Das
 
PDF
PhD Defense Slides
Debasmit Das
 
A literature survey of benchmark functions for global optimisation problems
Xin-She Yang
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
cscpconf
 
Preliminary Exam Slides
Debasmit Das
 
IRJET- Survey of Feature Selection based on Ant Colony
IRJET Journal
 
Predicting best classifier using properties of data sets
Abhishek Vijayvargia
 
Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...
Debasmit Das
 
PhD Defense Slides
Debasmit Das
 

What's hot (17)

PDF
Study on Relavance Feature Selection Methods
IRJET Journal
 
PDF
Doctoral Thesis Dissertation 2014-03-20 @PoliMi
Davide Chicco
 
PDF
IRJET- Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...
IRJET Journal
 
PDF
An unsupervised feature selection algorithm with feature ranking for maximizi...
Asir Singh
 
PDF
Your Classifier is Secretly an Energy based model and you should treat it lik...
Seunghyun Hwang
 
PPT
Strategies for Metabolomics Data Analysis
Dmitry Grapov
 
PPTX
Improved Teaching Leaning Based Optimization Algorithm
rajani51
 
PDF
Reference Scope Identification of Citances Using Convolutional Neural Network
Saurav Jha
 
PDF
Concatenated decision paths classification for time series shapelets
ijics
 
PDF
Comparison between the genetic algorithms optimization and particle swarm opt...
IAEME Publication
 
PDF
Question Classification using Semantic, Syntactic and Lexical features
IJwest
 
PDF
Feature selection
Dong Guo
 
PDF
Optimal rule set generation using pso algorithm
csandit
 
PDF
Migration strategies for object oriented system to component based system
ijfcstjournal
 
PPTX
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Abdel Salam Sayyad
 
PPTX
A Collaborative Document Ranking Model for a Multi-Faceted Search
UPMC - Sorbonne Universities
 
PDF
ES-Rank: Evolutionary Strategy Learning to Rank Approach (Presentation)
Minia University, Egypt
 
Study on Relavance Feature Selection Methods
IRJET Journal
 
Doctoral Thesis Dissertation 2014-03-20 @PoliMi
Davide Chicco
 
IRJET- Comparative Study of PCA, KPCA, KFA and LDA Algorithms for Face Re...
IRJET Journal
 
An unsupervised feature selection algorithm with feature ranking for maximizi...
Asir Singh
 
Your Classifier is Secretly an Energy based model and you should treat it lik...
Seunghyun Hwang
 
Strategies for Metabolomics Data Analysis
Dmitry Grapov
 
Improved Teaching Leaning Based Optimization Algorithm
rajani51
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Saurav Jha
 
Concatenated decision paths classification for time series shapelets
ijics
 
Comparison between the genetic algorithms optimization and particle swarm opt...
IAEME Publication
 
Question Classification using Semantic, Syntactic and Lexical features
IJwest
 
Feature selection
Dong Guo
 
Optimal rule set generation using pso algorithm
csandit
 
Migration strategies for object oriented system to component based system
ijfcstjournal
 
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
Abdel Salam Sayyad
 
A Collaborative Document Ranking Model for a Multi-Faceted Search
UPMC - Sorbonne Universities
 
ES-Rank: Evolutionary Strategy Learning to Rank Approach (Presentation)
Minia University, Egypt
 
Ad

Similar to Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model (20)

PDF
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...
Lifeng (Aaron) Han
 
PDF
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
IRJET Journal
 
PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
ijnlc
 
PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
kevig
 
PPTX
Deep Semi-supervised Learning methods
Princy Joy
 
PDF
Neural Semi-supervised Learning under Domain Shift
Sebastian Ruder
 
PDF
(DL輪読)Matching Networks for One Shot Learning
Masahiro Suzuki
 
PDF
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
BigDataCloud
 
PPTX
240805_Thuy_Labseminar[Self-supervised Graph Learning for Recommendation].pptx
thanhdowork
 
PPTX
asdrfasdfasdf
SwayattaDaw1
 
DOC
P-6
butest
 
DOC
P-6
butest
 
PDF
ANN Based POS Tagging For Nepali Text
ijnlc
 
PDF
Named Entity Recognition using Bi-LSTM and Tenserflow Model
IRJET Journal
 
PPTX
Reasoning Over Knowledge Base
Shubham Agarwal
 
PPTX
Reasoning Over Knowledge Base
Shubham Agarwal
 
PPTX
Effective Named Entity Recognition for Idiosyncratic Web Collections
eXascale Infolab
 
PDF
Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会
Eiji Sekiya
 
PDF
Improved Text Mining for Bulk Data Using Deep Learning Approach
IJCSIS Research Publications
 
DOC
Abstract
butest
 
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...
Lifeng (Aaron) Han
 
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
IRJET Journal
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
ijnlc
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
kevig
 
Deep Semi-supervised Learning methods
Princy Joy
 
Neural Semi-supervised Learning under Domain Shift
Sebastian Ruder
 
(DL輪読)Matching Networks for One Shot Learning
Masahiro Suzuki
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
BigDataCloud
 
240805_Thuy_Labseminar[Self-supervised Graph Learning for Recommendation].pptx
thanhdowork
 
asdrfasdfasdf
SwayattaDaw1
 
P-6
butest
 
P-6
butest
 
ANN Based POS Tagging For Nepali Text
ijnlc
 
Named Entity Recognition using Bi-LSTM and Tenserflow Model
IRJET Journal
 
Reasoning Over Knowledge Base
Shubham Agarwal
 
Reasoning Over Knowledge Base
Shubham Agarwal
 
Effective Named Entity Recognition for Idiosyncratic Web Collections
eXascale Infolab
 
Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会
Eiji Sekiya
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
IJCSIS Research Publications
 
Abstract
butest
 
Ad

More from Lifeng (Aaron) Han (20)

PDF
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
Lifeng (Aaron) Han
 
PDF
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Lifeng (Aaron) Han
 
PDF
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Lifeng (Aaron) Han
 
PDF
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
Lifeng (Aaron) Han
 
PDF
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
Lifeng (Aaron) Han
 
PDF
Meta-evaluation of machine translation evaluation methods
Lifeng (Aaron) Han
 
PDF
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Lifeng (Aaron) Han
 
PDF
Apply chinese radicals into neural machine translation: deeper than character...
Lifeng (Aaron) Han
 
PDF
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
Lifeng (Aaron) Han
 
PDF
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Lifeng (Aaron) Han
 
PDF
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Lifeng (Aaron) Han
 
PDF
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Lifeng (Aaron) Han
 
PDF
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
Lifeng (Aaron) Han
 
PDF
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
Lifeng (Aaron) Han
 
PDF
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Lifeng (Aaron) Han
 
PDF
A deep analysis of Multi-word Expression and Machine Translation
Lifeng (Aaron) Han
 
PDF
machine translation evaluation resources and methods: a survey
Lifeng (Aaron) Han
 
PDF
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Lifeng (Aaron) Han
 
PPTX
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Lifeng (Aaron) Han
 
PDF
PubhD talk: MT serving the society
Lifeng (Aaron) Han
 
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
Lifeng (Aaron) Han
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Lifeng (Aaron) Han
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
Lifeng (Aaron) Han
 
Meta-evaluation of machine translation evaluation methods
Lifeng (Aaron) Han
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Lifeng (Aaron) Han
 
Apply chinese radicals into neural machine translation: deeper than character...
Lifeng (Aaron) Han
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
Lifeng (Aaron) Han
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Lifeng (Aaron) Han
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Lifeng (Aaron) Han
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Lifeng (Aaron) Han
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
Lifeng (Aaron) Han
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
Lifeng (Aaron) Han
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Lifeng (Aaron) Han
 
A deep analysis of Multi-word Expression and Machine Translation
Lifeng (Aaron) Han
 
machine translation evaluation resources and methods: a survey
Lifeng (Aaron) Han
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Lifeng (Aaron) Han
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Lifeng (Aaron) Han
 
PubhD talk: MT serving the society
Lifeng (Aaron) Han
 

Recently uploaded (20)

PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
DOCX
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
PPTX
TEF & EA Bsc Nursing 5th sem.....BBBpptx
AneetaSharma15
 
PPTX
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PPTX
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
PPTX
How to Apply for a Job From Odoo 18 Website
Celine George
 
PDF
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Software Engineering BSC DS UNIT 1 .pptx
Dr. Pallawi Bulakh
 
PDF
RA 12028_ARAL_Orientation_Day-2-Sessions_v2.pdf
Seven De Los Reyes
 
PPTX
CDH. pptx
AneetaSharma15
 
PPTX
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
PPTX
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
PDF
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
TEF & EA Bsc Nursing 5th sem.....BBBpptx
AneetaSharma15
 
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
How to Apply for a Job From Odoo 18 Website
Celine George
 
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
Software Engineering BSC DS UNIT 1 .pptx
Dr. Pallawi Bulakh
 
RA 12028_ARAL_Orientation_Day-2-Sessions_v2.pdf
Seven De Los Reyes
 
CDH. pptx
AneetaSharma15
 
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 

Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model

  • 2. Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model Aaron Li-Feng Han* Xiaodong Zeng+ Derek F. Wong+ Lidia S. Chao+ * ILLC, University of Amsterdam, The Netherlands + NLP2CT Laboratory, University of Macau, Macau S.A.R., China SIGHAN 2015 @ Beijing, July 30-31
  • 3. • NER Tasks • Traditional methods • Motivation • Designed model • Experiments • Conclusion
  • 4. NER Tasks NER tasks: The annotations in MUC-7 Named Entity tasks (Marsh and Perzanowski, 1998) - Entities (organization, person, and location), times and quantities such as monetary values and percentages - Languages of English, Chinese and Japanese.
  • 5. The entity categories in CONLL-02 (Tjong Kim Sang, 2002) and CONLL-03 (Tjong Kim Sang and De Meulder, 2003) NER shared tasks - Persons, locations, organizations and names of miscellaneous entities - Languages span from Spanish, Dutch, English, to German. The SIGHAN bakeoff-3 (Levow, 2006) and bakeoff-4 (Jin and Chen, 2008) tasks offer standard Chinese NER (CNER) corpora - Three commonly used entities, i.e., personal names, location names, and organization names.
  • 6. Traditional methods Traditional methods used for the entity recognition tend to employ external annotated corpora to enhance the machine learning stage, and improve the testing scores using the enhanced models (Zhang et al., 2006; Mao et al., 2008; Yu et al., 2008). The conditional random filed (CRF) models have shown advantages and good performances in CNER tasks as compared with other machine learning algorithms (Zhou et al., 2006; Zhao and Kit, 2008) However, the annotated corpora are generally very expensive and time consuming.
  • 7. On the other hand, there are a lot of freely available unlabeled data in the internet that can be used for our researches. Due to this reason, some researchers begin to explore the usage of the unlabeled data and the semi-supervised learning methods based on labeled training data and unlabeled external data have shown their ad- vantages (Blum and Chawla, 2001; Shin et al., 2006; Zha et al., 2008; Zhang et al., 2013).
  • 8. Motivation Named entity recognition (NER) plays an important role in the NLP literature The traditional methods tend to employ large annotated corpus to achieve a high performance. - However the annotated corpus is usually expensive to gain. Many semi-supervised learning models for NER task are proposed to utilize the freely available unlabeled data. Graph-based semi-supervised learning (GBSSL) methods have been employed in many NLP tasks, e.g. - sentiment categorization (Goldberg and Zhu, 2006) - question- answering (Celikyilmaz et al., 2009) - class-Instance acquisition (Talukdar and Pereira, 2010) - structured tagging models (Subramanya et al., 2010) - Joint Chinese word segmentation and part of speech (POS) tagging (Zeng et al., 2013)
  • 9. Why not: GBSSL for NER Extend the labeled data Enhance the learning model , e.g.conditional random field (CRF)
  • 10. Designed model • Enhanced learning with unlabelled data • GBSSL to extend labeled data • CRF learning
  • 11. • Graph based semi-supervised learning • - Graph Construction & Label Propagation
  • 12. • Graph Construction • - Follow the research of Subramanya et al. (2010), represent the vertices using character trigrams in labeled and unlabeled sentences for graph construction • - A symmetric k-NN graph is utilized with the edge weights calculated by a symmetric similarity function designed by Zeng et al. (2013).
  • 13. • where k(i) is the k nearest neighbours of xi. sim() is a similarity measure of two vertices. The similarity is computed based on the co-occurrence statistics (Zeng et al., 2013). wi, j = sim(xi ,xj ) if j ∈k(i) or i ∈k( j) 0 otherwise ⎧ ⎨ ⎪ ⎩ ⎪
  • 14. The feature set we employed to measure the similarity of two vertices based on the co-occurrence statistics is the optimized one by Han et al. (2013) for CNER tasks, as denoted in Table 1.
  • 15. • Label propagation After the graph construction on both labeled and unlabeled data - Use the sparsity inducing penalty (Das and Smith, 2012) label propagation algorithm to induce trigram level label distributions from the constructed graph - Based on the Junto toolkit (Talukdar and Pereira, 2010).
  • 16. Enhance CRF learning - put the propagated labeled data into the training data - based on the CRF++ toolkit - Use the same feature set as in Table 1.
  • 17. Experiments Dataset - We employ the SIGHAN bakeoff-3 (Levow, 2006) MSRA (Microsoft research of Asia) training and testing data as standard setting. - To testify the effectiveness of the GBSSL method for CRF model in CNER tasks, we utilize some plain (un- annotated) text from SIGHAN bakeoff-2 (Emerson, 2005) and bakeoff-4 (Jin and Chen, 2008) as external unlabeled data. - The data set is introduced in Table 2 from the aspect of sentence number.
  • 19. We set two baseline scores for the evaluation. - One baseline is the simple left-to-right maximum matching model (MaxMatch) based on the training data - Another baseline is the closed CRF model (Closed-CRF) without using unlabeled data. - The employment of GBSSL model into semi-supervised CRF learning is denoted as GBSSL-CRF.
  • 20. • Training costs - The comparison shows that the extracted features grow from 8,729,098 to 11,336,486 (29.87%) due to the external dataset, and the corresponding iterations and training hours also grow by 12.86% and 77.04% respectively. - The training costs of the CRF learning stage are detailed in Table 3.
  • 22. • Evaluation results - The evaluation results are shown in Table 4, from the aspects of recall, precision and the harmonic mean of recall and precision (F1- score).
  • 23. - The evaluation shows that both the Closed-CRF and GBSSL-CRF models have largely outperformed baseline-1 model (MaxMatch). - As compared with the Closed-CRF model, the GBSSL- CRF model yielded a higher performance in precision score, a lower performance in recall score, and finally resulted in a faint improvement in F1 score. - Both the GBSSL-CRF and Closed-CRF show higher performance in precision and lower performance in recall value.
  • 24. To look inside the GBSSL performance on each kind of entity - denote the detailed evaluation results from the aspect of F1-score in Table 5.
  • 25. - The detailed evaluation from three kinds of entities shows that both the GBSSL-CRF and Closed-CRF show higher performance in LOC entity type, and lower performance in PER and ORG entities. - Fortunately, the GBSSL model can enhance the CRF learning on the two kinds of difficult entities PER and ORG with the better perfor- mances of 0.28% and 0.58% respectively. However, the GBSSL model decreases the F1 score on LOC entity by 0.19%.
  • 26. The lower performance of GBSSL model on LOC entity may be due to that the unlabelled data is only as much as 62.75% of the training corpus, which is not large enough to cover the Out-of-Vocabulary (OOV) testing words of LOC entity; on the other hand, the unlabeled data also bring some noise into the model.
  • 27. Conclusion and future works This paper makes an effort to see the effectiveness of the GBSSL model for the traditional CNER task. Extend data - The experiments verify that the GBSSL can enhance the state-of-the- art CRF learning models. The improvement score is a little weak because the unlabeled data is not large enough. In the future work, we decide to use larger unlabeled dataset to enhance the CRF learning model.
  • 28. Feature selection: The feature set optimized for CRF learning may be not the best one for the similarity calculation in graph construction stage. So we will make efforts to select the best feature set for the measuring of vertices similarity in graph construction on CNER documents.
  • 29. Corpus type: In this paper, we utilized the Microsoft research of Asia corpus for experiments. We will use more kinds of Chinese corpora for testing, such as CITYU and LDC corpus, etc.
  • 30. Performance on OOV words: The GBSSL model generally improves the tagging accuracy of the Out- of-Vocabulary OOV) words in the test data, which are unseen in the training corpora. In the future work, we plan to give a detailed analysis of the GBSSL model performance on the OOV words for CNER tasks.
  • 31. • Thanks for your attention! View publication statsView publication stats