SlideShare a Scribd company logo
USE OF ANNOTATED CORPUS
Thennarasu Sakkan
Annotated Text Corpora is an important resource
for advances in NLP research and for developing
different language technologies.
The annotation of corpora is done using a set of
tags, which mark the linguistic properties of a word,
sentence or discourse.
The corpora annotated with various linguistic
information not only forms a precious resource for
language technologies but also involves large
amount of effort and time.
Therefore, it is important to create corpora which
once created can be used for various purposes.
Layered approach
It was proposed to follow a layered approach. Some of the
layers are:
Layer 1: Morphology
Layer 2: POS <morphosyntactic>
Layer 3: LWG
Layer 4: Chunks
Layer 5: Syntactic Analysis
Layer 6: Thematic roles/Predicate Argument structure
Layer 7: Semantic properties of the lexical items
Layers 8,9,10,11: Word sense, Pronoun referents (Anaphora),
etc, etc
Example,
((My younger sister Suguna))_NP
((will be coming))_VP ((from Tamil
Nadu))_PP ((early this month))_NP.
((செவ்஬ா஦ில்_NNP))_NP ((ச஬ற்நிக஧஥ாக_RB))_RBP
((ர஧ா஬ர்_NNP ஬ிண்கனம்_NN))_NP ((஡ர஧஦ிநங்கி஦து_VF))_VP
!
(஢ாொ_NNP ஬ிஞ்ஞாணிகள்_NN))_NP ((ொ஡ரண_NN))_NP
!!_RD_SYM (See here exclamation marker.)
((஢ியூ஦ார்க்_NNP))_NP :_RD_PUNC ((செவ்஬ாய்_NNP
கி஧கத்ர஡_NN ஆய்வு_NN))_NP ((செய்஬஡ற்காக_RB))_RBP
((அச஥ரிக்கா_NNP))_NP ((அனுப்தி஦_VNF))_VGNF (ர஧ா஬ர்_NNP
஬ிண்கனம்_NN))_NP ((கிட்டத்஡ட்ட_RB))_RBP ((8_TC? ஥ா஡_NN
த஦஠த்஡ிற்கு_NN))_NP ((திநகு_NST))_? இன்று_NST))_?
(06.08.12) ((ச஬ற்நிக஧஥ாக_RB))_RBP
((஡ர஧஦ிநங்கி஦து_VF))_VP ((._PUNC))_?
((஬ிண்ச஬பி_NN ஆய்வு_NN ர஥஦த்஡ில்_NN))_NP
((இது_PRP))_?? ((ஒய௃_TC ஥ிகப்_INTF சதரி஦_JJ
ர஥ல்கல்னாக_RB??))_NP?? / RBP?? ((கய௃஡ப்தடுகிநது_VF))_VP
((._PUNC))_??
((பூ஥ி஦ில்_NN))_NP ((இய௃ந்து_N_NST))_NP?/N_ST?
((சு஥ார்_RB)) ((570_TC ஥ில்னி஦ன்_NN கி.஥ீ.,_NN
ச஡ாரன஬ில்_NN))_NP ((உள்பது_VF))_VGF
((செவ்஬ாய்_NNP கி஧கம்_NNP))_NP ._PUNC
((இந்஡_DMD கி஧கத்஡ில்_NN ஊ஦ிரிணங்கள்_NN))_NP
((஬ாழ்஬஡ற்காண_VNF))_VGNF ((஌ற்ந_JJ சூ஫ல்_NN))_NP
((இய௃க்கிந஡ா_VF))_VGF ((஋ன்தது_CCS))_??
((குநித்து_PSP))_?? ((ஆய்வு_NN))_NP
((செய்஦_VINF))_VGINF ((அச஥ரிக்கா஬ின்_NNP ஢ாொ_NNP
஬ிண்ச஬பி_NNP ஆ஧ாய்ச்ெி_NNP ர஥஦ம்_NNP))_NP
((தல்ர஬று_JJ))_JJP ((ஆய்வுகரப_NN))_NP
((ர஥ற்சகாண்டு_VNF))_VGNF ((஬ய௃கிநது_VF))_VGF.
((செவ்஬ாய்_NNP கி஧கம்_NN))_NP ((ச஡ாடர்தாண_JJ))_JJP
((தடங்கரபயும்_NN))_NP ((அவ்஬ப்ரதாது_RB))_RBP
((ச஬பி஦ிட்டு_VNF ஬ய௃கிநது_VM))_VGF ._SYM
How are corpora annotated?
• Automatic annotation
• Computer-assisted annotation
• Manual annotation
Sinclair (1992): the introduction of the human
element in corpus annotation reduces
consistency.
Corpus in NLP
NLP is unthinkable without involving corpora.
Corpora are essential ingredients of every aspects
of natural language processing
a) Morph analysis – the morph features of a given
word are marked. If the word has multiple
morph feature sets, all are provided for it.
• Morphological level
–Prefixes
–Suffixes
–Stems - (morphological annotation)
Example: pens <root=”pen” cat=”n” gender=”m”
number=”pl” person=”3”>|<root=”pen”
cat=”v” gender=”m” number=”sing”
person=”3” tense=”present” aspect=”hab”>
Corpus Vs Morph
• 10% 63 54 59 4 (Te, Ma, Ta, Hi,)
• 20% 293 335 257 11
• 30% 934 1196 728 26
• 40% 2433 3439 1803 74
• 50% 5707 8810 4091 186
• 60% 13280 21663 8992 454
• 70% 31941 53718 20191 1092
b) POS a word is tagged for its POS category in a
given sentence.
Example: I need two <pos=”NN”>pens
</pos=”NN”> to finish this article. He
<pos=”VBS”> pens </pos=”VBS”> his views
regularly.
c) Word sense – the appropriate sense of a word in a
given context is marked.
Example: I need two <word_sense=”pen”> pens
</word_sense=”pen”> to finish this article. He
<word_sense=”write”> pens
</word_sense=”write”> his views regularly.
POS Vs Corpus
11% of words in Brow corpus are ambiguous.
What about our languages?
At the sentence level the information could be
a) Identification of chunks/MWEs/LWGs/phrases
Chunks are minimal constituent units.
The chunk analysis of a sentence provides a
shallow level of parsing. Thus, a corpora
annotated with POS and chunks can be useful for
building a shallow parser.
Example, I saw a man with telescope.
• Syntactic level
– parsing
– treebanking
– bracketing
• Discourse level
– Anaphoric relations (coreference annotation)
– Speech acts (pragmatic annotation)
– Stylistic features such as speech and thought
in presentation (stylistic annotation).
Corpus Vs Machine translation
parallel and comparable corpora, which include
their use in lexicography, terminology extraction to
build terminology databases and bilingual reference
tools, pride of place must be given to machine
translation (MT).
parallel corpora have played a pivotal role in a
(partial) paradigm shift from rule-based approaches
to statistical and example-based approaches to MT.
Essentially, statistical MT (SMT) involves computing
the probability that a TL string is the translation of
an SL string, based on the frequency of the co-
occurrence of these strings in the corpus, whereas
example-based MT (EBMT) involves searching for
similar phrases in previous translations and
extracting the TL fragments corresponding to the SL
fragments.
Show demo on KWIC
5a use of annotated corpus

More Related Content

PPTX
Natural language processing
Saurav Aryal
 
PPTX
Natural language-processing
Hareem Naz
 
PPTX
Natural Language processing Parts of speech tagging, its classes, and how to ...
Rajnish Raj
 
PDF
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
ijnlc
 
PPTX
Natural Language Processing
Varunjeet Singh Rekhi
 
PDF
Natural language processing with python and amharic syntax parse tree by dani...
Daniel Adenew
 
PPTX
Shallow parser for hindi language with an input from a transliterator
Shashank Shisodia
 
PPTX
NLP
Jeet Das
 
Natural language processing
Saurav Aryal
 
Natural language-processing
Hareem Naz
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Rajnish Raj
 
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYA
ijnlc
 
Natural Language Processing
Varunjeet Singh Rekhi
 
Natural language processing with python and amharic syntax parse tree by dani...
Daniel Adenew
 
Shallow parser for hindi language with an input from a transliterator
Shashank Shisodia
 

What's hot (20)

PPTX
Natural Language Processing
Saurabh Kaushik
 
PPTX
Prosodic Morphology
Maroua Harrif
 
PDF
Natural language processing
National Institute of Technology Durgapur
 
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PPT
Natural language processing
Basha Chand
 
PPTX
NLP pipeline in machine translation
Marcis Pinnis
 
PPTX
NLP_KASHK:Text Normalization
Hemantha Kulathilake
 
DOCX
Natural Language Processing
Mariana Soffer
 
PDF
Networks and Natural Language Processing
Ahmed Magdy Ezzeldin, MSc.
 
PDF
A Review on a web based Punjabi t o English Machine Transliteration System
Editor IJCATR
 
PDF
Natural Language Processing glossary for Coders
Aravind Mohanoor
 
PDF
Corpus-based part-of-speech disambiguation of Persian
IDES Editor
 
PPT
Natural Language Processing
Yasir Khan
 
PDF
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ijnlc
 
PPTX
Lecture 1: Semantic Analysis in Language Technology
Marina Santini
 
PDF
Hidden markov model based part of speech tagger for sinhala language
ijnlc
 
PDF
Introduction to natural language processing
Minh Pham
 
PDF
Natural Language Processing (NLP)
Yuriy Guts
 
Natural Language Processing
Saurabh Kaushik
 
Prosodic Morphology
Maroua Harrif
 
Natural language processing
National Institute of Technology Durgapur
 
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Natural language processing
Basha Chand
 
NLP pipeline in machine translation
Marcis Pinnis
 
NLP_KASHK:Text Normalization
Hemantha Kulathilake
 
Natural Language Processing
Mariana Soffer
 
Networks and Natural Language Processing
Ahmed Magdy Ezzeldin, MSc.
 
A Review on a web based Punjabi t o English Machine Transliteration System
Editor IJCATR
 
Natural Language Processing glossary for Coders
Aravind Mohanoor
 
Corpus-based part-of-speech disambiguation of Persian
IDES Editor
 
Natural Language Processing
Yasir Khan
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ijnlc
 
Lecture 1: Semantic Analysis in Language Technology
Marina Santini
 
Hidden markov model based part of speech tagger for sinhala language
ijnlc
 
Introduction to natural language processing
Minh Pham
 
Natural Language Processing (NLP)
Yuriy Guts
 
Ad

Similar to 5a use of annotated corpus (20)

PDF
5 relevance of annotated corpus
ThennarasuSakkan
 
PDF
Topic Smoothing and back off for nlp presentation.pdf
muhammadaslam427726
 
DOCX
Pos Tagging for Classical Tamil Texts
ijcnes
 
PPTX
nlp (1).pptx
Subramanian Mani
 
PDF
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
PDF
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
PDF
Tokenization in NLP Methods Types and Challenges.pdf
SoluLab1231
 
PDF
An implementation of apertium based assamese morphological analyzer
ijnlc
 
PDF
Poster @ enetCollect CA MC meeting in Iasi, Romania
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
DOC
REPORT.doc
IswaryaPurushothaman1
 
PDF
Building of Database for English-Azerbaijani Machine Translation Expert System
Waqas Tariq
 
PPTX
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
PPTX
Sanskrit parser Project Report
Laxmi Kant Yadav
 
PPTX
6CS4_AI_Unit-5 @zammers.pptx(for artificial intelligence)
Abhishekjain980450
 
PDF
Identification of prosodic features of punjabi for enhancing the pronunciatio...
ijnlc
 
PPTX
PPT Unit 5=software- engineering-21.pptx
sasad51302
 
PDF
Ijarcet vol-3-issue-3-623-625 (1)
Dhabal Sethi
 
PDF
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
iosrjce
 
PDF
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
ijnlc
 
5 relevance of annotated corpus
ThennarasuSakkan
 
Topic Smoothing and back off for nlp presentation.pdf
muhammadaslam427726
 
Pos Tagging for Classical Tamil Texts
ijcnes
 
nlp (1).pptx
Subramanian Mani
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
Tokenization in NLP Methods Types and Challenges.pdf
SoluLab1231
 
An implementation of apertium based assamese morphological analyzer
ijnlc
 
Poster @ enetCollect CA MC meeting in Iasi, Romania
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Building of Database for English-Azerbaijani Machine Translation Expert System
Waqas Tariq
 
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Sanskrit parser Project Report
Laxmi Kant Yadav
 
6CS4_AI_Unit-5 @zammers.pptx(for artificial intelligence)
Abhishekjain980450
 
Identification of prosodic features of punjabi for enhancing the pronunciatio...
ijnlc
 
PPT Unit 5=software- engineering-21.pptx
sasad51302
 
Ijarcet vol-3-issue-3-623-625 (1)
Dhabal Sethi
 
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
iosrjce
 
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
ijnlc
 
Ad

More from ThennarasuSakkan (8)

PDF
11 terms in corpus linguistics1 (1)
ThennarasuSakkan
 
PDF
11 terms in Corpus Linguistics1 (2)
ThennarasuSakkan
 
PDF
8 issues in pos tagging
ThennarasuSakkan
 
PDF
7 probability and statistics an introduction
ThennarasuSakkan
 
PDF
6 shallow parsing introduction
ThennarasuSakkan
 
PDF
4 salient features of corpus
ThennarasuSakkan
 
PDF
2 why python for nlp
ThennarasuSakkan
 
PDF
1 computational linguistics an introduction
ThennarasuSakkan
 
11 terms in corpus linguistics1 (1)
ThennarasuSakkan
 
11 terms in Corpus Linguistics1 (2)
ThennarasuSakkan
 
8 issues in pos tagging
ThennarasuSakkan
 
7 probability and statistics an introduction
ThennarasuSakkan
 
6 shallow parsing introduction
ThennarasuSakkan
 
4 salient features of corpus
ThennarasuSakkan
 
2 why python for nlp
ThennarasuSakkan
 
1 computational linguistics an introduction
ThennarasuSakkan
 

Recently uploaded (20)

PPTX
CDH. pptx
AneetaSharma15
 
PPTX
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
PPTX
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
PDF
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
DOCX
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
PDF
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
DOCX
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
PPTX
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PPTX
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
PPTX
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
CDH. pptx
AneetaSharma15
 
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
Basics and rules of probability with real-life uses
ravatkaran694
 
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 

5a use of annotated corpus

  • 1. USE OF ANNOTATED CORPUS Thennarasu Sakkan
  • 2. Annotated Text Corpora is an important resource for advances in NLP research and for developing different language technologies. The annotation of corpora is done using a set of tags, which mark the linguistic properties of a word, sentence or discourse. The corpora annotated with various linguistic information not only forms a precious resource for language technologies but also involves large amount of effort and time.
  • 3. Therefore, it is important to create corpora which once created can be used for various purposes. Layered approach It was proposed to follow a layered approach. Some of the layers are: Layer 1: Morphology Layer 2: POS <morphosyntactic> Layer 3: LWG Layer 4: Chunks Layer 5: Syntactic Analysis Layer 6: Thematic roles/Predicate Argument structure Layer 7: Semantic properties of the lexical items Layers 8,9,10,11: Word sense, Pronoun referents (Anaphora), etc, etc
  • 4. Example, ((My younger sister Suguna))_NP ((will be coming))_VP ((from Tamil Nadu))_PP ((early this month))_NP.
  • 5. ((செவ்஬ா஦ில்_NNP))_NP ((ச஬ற்நிக஧஥ாக_RB))_RBP ((ர஧ா஬ர்_NNP ஬ிண்கனம்_NN))_NP ((஡ர஧஦ிநங்கி஦து_VF))_VP ! (஢ாொ_NNP ஬ிஞ்ஞாணிகள்_NN))_NP ((ொ஡ரண_NN))_NP !!_RD_SYM (See here exclamation marker.) ((஢ியூ஦ார்க்_NNP))_NP :_RD_PUNC ((செவ்஬ாய்_NNP கி஧கத்ர஡_NN ஆய்வு_NN))_NP ((செய்஬஡ற்காக_RB))_RBP ((அச஥ரிக்கா_NNP))_NP ((அனுப்தி஦_VNF))_VGNF (ர஧ா஬ர்_NNP ஬ிண்கனம்_NN))_NP ((கிட்டத்஡ட்ட_RB))_RBP ((8_TC? ஥ா஡_NN த஦஠த்஡ிற்கு_NN))_NP ((திநகு_NST))_? இன்று_NST))_? (06.08.12) ((ச஬ற்நிக஧஥ாக_RB))_RBP ((஡ர஧஦ிநங்கி஦து_VF))_VP ((._PUNC))_? ((஬ிண்ச஬பி_NN ஆய்வு_NN ர஥஦த்஡ில்_NN))_NP ((இது_PRP))_?? ((ஒய௃_TC ஥ிகப்_INTF சதரி஦_JJ ர஥ல்கல்னாக_RB??))_NP?? / RBP?? ((கய௃஡ப்தடுகிநது_VF))_VP ((._PUNC))_??
  • 6. ((பூ஥ி஦ில்_NN))_NP ((இய௃ந்து_N_NST))_NP?/N_ST? ((சு஥ார்_RB)) ((570_TC ஥ில்னி஦ன்_NN கி.஥ீ.,_NN ச஡ாரன஬ில்_NN))_NP ((உள்பது_VF))_VGF ((செவ்஬ாய்_NNP கி஧கம்_NNP))_NP ._PUNC ((இந்஡_DMD கி஧கத்஡ில்_NN ஊ஦ிரிணங்கள்_NN))_NP ((஬ாழ்஬஡ற்காண_VNF))_VGNF ((஌ற்ந_JJ சூ஫ல்_NN))_NP ((இய௃க்கிந஡ா_VF))_VGF ((஋ன்தது_CCS))_?? ((குநித்து_PSP))_?? ((ஆய்வு_NN))_NP ((செய்஦_VINF))_VGINF ((அச஥ரிக்கா஬ின்_NNP ஢ாொ_NNP ஬ிண்ச஬பி_NNP ஆ஧ாய்ச்ெி_NNP ர஥஦ம்_NNP))_NP ((தல்ர஬று_JJ))_JJP ((ஆய்வுகரப_NN))_NP ((ர஥ற்சகாண்டு_VNF))_VGNF ((஬ய௃கிநது_VF))_VGF. ((செவ்஬ாய்_NNP கி஧கம்_NN))_NP ((ச஡ாடர்தாண_JJ))_JJP ((தடங்கரபயும்_NN))_NP ((அவ்஬ப்ரதாது_RB))_RBP ((ச஬பி஦ிட்டு_VNF ஬ய௃கிநது_VM))_VGF ._SYM
  • 7. How are corpora annotated? • Automatic annotation • Computer-assisted annotation • Manual annotation Sinclair (1992): the introduction of the human element in corpus annotation reduces consistency.
  • 8. Corpus in NLP NLP is unthinkable without involving corpora. Corpora are essential ingredients of every aspects of natural language processing
  • 9. a) Morph analysis – the morph features of a given word are marked. If the word has multiple morph feature sets, all are provided for it. • Morphological level –Prefixes –Suffixes –Stems - (morphological annotation) Example: pens <root=”pen” cat=”n” gender=”m” number=”pl” person=”3”>|<root=”pen” cat=”v” gender=”m” number=”sing” person=”3” tense=”present” aspect=”hab”>
  • 10. Corpus Vs Morph • 10% 63 54 59 4 (Te, Ma, Ta, Hi,) • 20% 293 335 257 11 • 30% 934 1196 728 26 • 40% 2433 3439 1803 74 • 50% 5707 8810 4091 186 • 60% 13280 21663 8992 454 • 70% 31941 53718 20191 1092
  • 11. b) POS a word is tagged for its POS category in a given sentence. Example: I need two <pos=”NN”>pens </pos=”NN”> to finish this article. He <pos=”VBS”> pens </pos=”VBS”> his views regularly. c) Word sense – the appropriate sense of a word in a given context is marked. Example: I need two <word_sense=”pen”> pens </word_sense=”pen”> to finish this article. He <word_sense=”write”> pens </word_sense=”write”> his views regularly.
  • 12. POS Vs Corpus 11% of words in Brow corpus are ambiguous. What about our languages?
  • 13. At the sentence level the information could be a) Identification of chunks/MWEs/LWGs/phrases Chunks are minimal constituent units. The chunk analysis of a sentence provides a shallow level of parsing. Thus, a corpora annotated with POS and chunks can be useful for building a shallow parser. Example, I saw a man with telescope.
  • 14. • Syntactic level – parsing – treebanking – bracketing • Discourse level – Anaphoric relations (coreference annotation) – Speech acts (pragmatic annotation) – Stylistic features such as speech and thought in presentation (stylistic annotation).
  • 15. Corpus Vs Machine translation parallel and comparable corpora, which include their use in lexicography, terminology extraction to build terminology databases and bilingual reference tools, pride of place must be given to machine translation (MT). parallel corpora have played a pivotal role in a (partial) paradigm shift from rule-based approaches to statistical and example-based approaches to MT.
  • 16. Essentially, statistical MT (SMT) involves computing the probability that a TL string is the translation of an SL string, based on the frequency of the co- occurrence of these strings in the corpus, whereas example-based MT (EBMT) involves searching for similar phrases in previous translations and extracting the TL fragments corresponding to the SL fragments.
  • 17. Show demo on KWIC