5a use of annotated corpus

USE OF ANNOTATED CORPUS
Thennarasu Sakkan

Annotated Text Corpora is an important resource
for advances in NLP research and for developing
different language technologies.
The annotation of corpora is done using a set of
tags, which mark the linguistic properties of a word,
sentence or discourse.
The corpora annotated with various linguistic
information not only forms a precious resource for
language technologies but also involves large
amount of effort and time.

Therefore, it is important to create corpora which
once created can be used for various purposes.
Layered approach
It was proposed to follow a layered approach. Some of the
layers are:
Layer 1: Morphology
Layer 2: POS <morphosyntactic>
Layer 3: LWG
Layer 4: Chunks
Layer 5: Syntactic Analysis
Layer 6: Thematic roles/Predicate Argument structure
Layer 7: Semantic properties of the lexical items
Layers 8,9,10,11: Word sense, Pronoun referents (Anaphora),
etc, etc

Example,
((My younger sister Suguna))_NP
((will be coming))_VP ((from Tamil
Nadu))_PP ((early this month))_NP.

((செவ்஬ா஦ில்_NNP))_NP ((ச஬ற்நிக஧஥ாக_RB))_RBP
((ர஧ா஬ர்_NNP ஬ிண்கனம்_NN))_NP ((஡ர஧஦ிநங்கி஦து_VF))_VP
!
(஢ாொ_NNP ஬ிஞ்ஞாணிகள்_NN))_NP ((ொ஡ரண_NN))_NP
!!_RD_SYM (See here exclamation marker.)
((஢ியூ஦ார்க்_NNP))_NP :_RD_PUNC ((செவ்஬ாய்_NNP
கி஧கத்ர஡_NN ஆய்வு_NN))_NP ((செய்஬஡ற்காக_RB))_RBP
((அச஥ரிக்கா_NNP))_NP ((அனுப்தி஦_VNF))_VGNF (ர஧ா஬ர்_NNP
஬ிண்கனம்_NN))_NP ((கிட்டத்஡ட்ட_RB))_RBP ((8_TC? ஥ா஡_NN
த஦஠த்஡ிற்கு_NN))_NP ((திநகு_NST))_? இன்று_NST))_?
(06.08.12) ((ச஬ற்நிக஧஥ாக_RB))_RBP
((஡ர஧஦ிநங்கி஦து_VF))_VP ((._PUNC))_?
((஬ிண்ச஬பி_NN ஆய்வு_NN ர஥஦த்஡ில்_NN))_NP
((இது_PRP))_?? ((ஒய௃_TC ஥ிகப்_INTF சதரி஦_JJ
ர஥ல்கல்னாக_RB??))_NP?? / RBP?? ((கய௃஡ப்தடுகிநது_VF))_VP
((._PUNC))_??

((பூ஥ி஦ில்_NN))_NP ((இய௃ந்து_N_NST))_NP?/N_ST?
((சு஥ார்_RB)) ((570_TC ஥ில்னி஦ன்_NN கி.஥ீ.,_NN
ச஡ாரன஬ில்_NN))_NP ((உள்பது_VF))_VGF
((செவ்஬ாய்_NNP கி஧கம்_NNP))_NP ._PUNC
((இந்஡_DMD கி஧கத்஡ில்_NN ஊ஦ிரிணங்கள்_NN))_NP
((஬ாழ்஬஡ற்காண_VNF))_VGNF ((஌ற்ந_JJ சூ஫ல்_NN))_NP
((இய௃க்கிந஡ா_VF))_VGF ((஋ன்தது_CCS))_??
((குநித்து_PSP))_?? ((ஆய்வு_NN))_NP
((செய்஦_VINF))_VGINF ((அச஥ரிக்கா஬ின்_NNP ஢ாொ_NNP
஬ிண்ச஬பி_NNP ஆ஧ாய்ச்ெி_NNP ர஥஦ம்_NNP))_NP
((தல்ர஬று_JJ))_JJP ((ஆய்வுகரப_NN))_NP
((ர஥ற்சகாண்டு_VNF))_VGNF ((஬ய௃கிநது_VF))_VGF.
((செவ்஬ாய்_NNP கி஧கம்_NN))_NP ((ச஡ாடர்தாண_JJ))_JJP
((தடங்கரபயும்_NN))_NP ((அவ்஬ப்ரதாது_RB))_RBP
((ச஬பி஦ிட்டு_VNF ஬ய௃கிநது_VM))_VGF ._SYM

How are corpora annotated?
• Automatic annotation
• Computer-assisted annotation
• Manual annotation
Sinclair (1992): the introduction of the human
element in corpus annotation reduces
consistency.

Corpus in NLP
NLP is unthinkable without involving corpora.
Corpora are essential ingredients of every aspects
of natural language processing

a) Morph analysis – the morph features of a given
word are marked. If the word has multiple
morph feature sets, all are provided for it.
• Morphological level
–Prefixes
–Suffixes
–Stems - (morphological annotation)
Example: pens <root=”pen” cat=”n” gender=”m”
number=”pl” person=”3”>|<root=”pen”
cat=”v” gender=”m” number=”sing”
person=”3” tense=”present” aspect=”hab”>

Corpus Vs Morph
• 10% 63 54 59 4 (Te, Ma, Ta, Hi,)
• 20% 293 335 257 11
• 30% 934 1196 728 26
• 40% 2433 3439 1803 74
• 50% 5707 8810 4091 186
• 60% 13280 21663 8992 454
• 70% 31941 53718 20191 1092

b) POS a word is tagged for its POS category in a
given sentence.
Example: I need two <pos=”NN”>pens
</pos=”NN”> to finish this article. He
<pos=”VBS”> pens </pos=”VBS”> his views
regularly.
c) Word sense – the appropriate sense of a word in a
given context is marked.
Example: I need two <word_sense=”pen”> pens
</word_sense=”pen”> to finish this article. He
<word_sense=”write”> pens
</word_sense=”write”> his views regularly.

POS Vs Corpus
11% of words in Brow corpus are ambiguous.
What about our languages?

At the sentence level the information could be
a) Identification of chunks/MWEs/LWGs/phrases
Chunks are minimal constituent units.
The chunk analysis of a sentence provides a
shallow level of parsing. Thus, a corpora
annotated with POS and chunks can be useful for
building a shallow parser.
Example, I saw a man with telescope.

• Syntactic level
– parsing
– treebanking
– bracketing
• Discourse level
– Anaphoric relations (coreference annotation)
– Speech acts (pragmatic annotation)
– Stylistic features such as speech and thought
in presentation (stylistic annotation).

Corpus Vs Machine translation
parallel and comparable corpora, which include
their use in lexicography, terminology extraction to
build terminology databases and bilingual reference
tools, pride of place must be given to machine
translation (MT).
parallel corpora have played a pivotal role in a
(partial) paradigm shift from rule-based approaches
to statistical and example-based approaches to MT.

Essentially, statistical MT (SMT) involves computing
the probability that a TL string is the translation of
an SL string, based on the frequency of the co-
occurrence of these strings in the corpus, whereas
example-based MT (EBMT) involves searching for
similar phrases in previous translations and
extracting the TL fragments corresponding to the SL
fragments.

5a use of annotated corpus

More Related Content

What's hot (20)

Similar to 5a use of annotated corpus (20)

More from ThennarasuSakkan (8)

Recently uploaded (20)

5a use of annotated corpus