Sinmin - Corpus for Sinhala
Language
1
Upeksha W. D.
Wijayarathna D. G. C. D.
Siriwardena M. P.
Lasandun K. H. L.
Supervisors :
Dr. Chinthana Wimalasuriya
Prof. Gihan Dias
Mr. N. H. N. D. de Silva
Outline
● Introduction
● Crawler Implementation and Design
● Data Cleaning and Tokenizing Mechanisms
● Selecting Data Storage Mechanism
● Data Storage Model of SinMin
● User Interface Design and Implementation
● API Design and Implementation
● Unit Testing
● Performance Testing of the API
● Implemented Sample Usages
2
What is a Corpus??
“A corpus is a principled collection of
authentic texts stored electronically that
can be used to discover information about
language that may not have been noticed
through intuition alone.” - Bennet (2010)
3
Usages of a Corpus
● Implementing translators, spell checkers and grammar
checkers.
● Identifying lexical and grammatical features of a language.
● Identifying varieties of language of context of usage and
time.
● Retrieving statistical details of a language.
● Providing backend support for tools like OCR, POS Tagger,
etc.
4
Sinmin is a Corpus for Sinhala
language which is
➢ Continuously updating
➢ Dynamic (Scalable)
➢ Covers wide range of language (Structured and unstructured)
5
Architecture of Sinmin
6
Identified Sinhala Resources
7
News Academic Creative
Writing
Spoken Gazette
News Paper Text books Fiction Subtitle Gazette
News Items Religious Blogs
Wikipedia Magazine
mahawansa
Identified Sinhala Resources
8
Crawler Implementation and
Design
9
Crawlers are responsible of finding
web pages that contain sinhala
content, fetching, parsing and storing
them in a manageable format.
10
Crawler Architecture
11
Sample Xml File With One Article Stored In It
12
Crawler Controller
Crawler controller monitors and handles
the status of the web crawlers.
13
14
Data Cleaning and Tokenizing
Mechanisms used
15
Identified Issues
● Erroneous characters of the texts
● Short forms
● Consecutive Sinhala vowel sign problem
fixing
16
Erroneous Characters Of The
Texts
● Invalid Unicode characters
Eg: Characters in a private user area, Replacement
character
● Symbols
Eg: “,”, “.”, “{“, “(“, “?”
17
Erroneous Characters Of The
Texts
● Unwanted non-Sinhala characters
Eg: ‘u+200C’, Á, À, ®, ¡, ª, º
● Non-symbolic characters which were terminating
words
18
Short Forms
● Short forms consists of full stops.
● But those full stop marks aren’t separating
sentences nor words.
E.g.: පෙ. ව. (pm), රු. (Rupees)
19
Identified Common Short
Forms
"ඒ.", "බී.", "සී.", "ඩී.", "ඊ.", "එෆ්."
"පෙ.", "ව.", "ෙ.", "රු."
"0.", "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9."
20
Consecutive Sinhala Vowel Sign
Problem
21
Consecutive Sinhala Vowel Sign
Problem
● Solution: Mapping them into one format
● Convention: Only one vowel sign to a Sinhala
letter
22
Consecutive Sinhala Vowel Sign
Problem
23
Selecting Data Storage
Mechanism for Sinmin
24
The performance of data insertion and
retrieval mainly depend on the Data
Storage Mechanism used for the
Corpus.
25
We tested performance of several
database systems to determine what
should we use to store data.
26
We Considered Following Data
Storage Systems
27
We considered performance for
inserting data and for retrieving 12
different information needs.
Data set and source code
https://github.com/madurangasiriwardena/performance-test
28
Data Insertion Time Comparison
29
Information Retrieval Performance
Comparison - Part 1
30
Information Retrieval Performance
Comparison - Part 2
31
Cassandra performed better than others
in most of the scenarios, and its
insertion time increased linearly.
So we chose it for implementing
corpus.
32
Data Storage Model of Sinmin
33
● We Used Cassandra as the Main Storage System of
Sinmin
● Apache Cassandra version 2.1.2 used.
● cqlsh version 5.0.1 used
34
Cassandra
● Most queries of API are retrieved from Cassandra
Database.
● Cassandra Database consist of more than 50 Column
Families where each of them provides a specific
information need
35
Cassandra
● Oracle used as a backup storage server.
36
Oracle
Oracle
Schema
37
Wildcard Search Feature
Wildcard search feature enables users to run wild-
card queries on the corpus
Eg: පෙ?
ෙහ*
38
Wildcard Search Feature
● Implemented using Apache Solr
● More than 1.2 million distinct words
● Supports at most 10 asterisks and atmost 10
question marks
39
Sinhala Vowel Sign Problem At Wildcard
Search
In Sinhala Unicode, Sinhala vowel signs are separate
Unicode characters
40
Sinhala Vowel Sign Problem At Wildcard
Search
Solution: Represent Sinhala letter and vowel sign as one
entity
41
User Interface Design and
Implementation
42
● Web interface of Sinmin has been designed for users
who would prefer a visualised and summarized view
of statistical data of Sinmin.
● Visual design of the interface has been made in a
way that any user without prior experience of the
interface is able to fulfill his information
requirements with little effort.
43
Sinmin user interface allows to,
● Find the probability of an n-gram
● Find the most probable word comes after an n-gram
● Compare the usage of n-grams
● Find statistics of words, bigrams and trigrams
● Wildcard search
● Find latest articles for an n-gram
44
45
46
API Design and Implementation
47
REST API
● REST API to expose Corpus services
● Much complex and customizable data retrieval and
filtering
● Interface for third party applications to consume
48
REST API
● Depends on backend databases (Cassandra,
Oracle, Solr)
● Cassandra acts as main storage system
● Oracle is used as a backup database
● Solr is used for wildcard search functions
49
Architecture
50
API Functions
● wordFrequency
● bigramFrequency
● trigramFrequency
● frequentWords
● frequentBigrams
● frequentTrigrams
● latestArticlesForWord
● latestArticlesForBigram
● latestArticlesForTrigram
51
● frequentWordsAroundWord
● frequentWordsInPosition
● frequentWordsInPositionReverse
● frequentWordsAfterWordTimeRange
● frequentWordsAfterBigramTimeRange
● wordCount
● bigramCount
● trigramCount
Performance Testing of the API
52
Throughput Under Different Load
Conditions
53
Time Taken To Process Requests Under
Different Load Conditions
54
Full Stop Predictor For OCR
● One challenge in OCR development is identifying
fullstops.
● This tool is a consumer application of Sinmin that
predicts the full stop marks of Sinhala texts.
55
Publications
● Implementing a Corpus for Sinhala Language -
Symposium on Language Technology for South Asia
(Presented)
● Comparison between performance of various
database systems for implementing a language
corpus – 11th International Beyond Databases,
Architectures and Structures conference (Accepted)
56
Future Works
● Annotate Words with POS Taggers and lemmas.
● Implement tools and applications that make use of
the corpus
57
Q & A
58
Thank You!
59

More Related Content

PPTX
Sinmin Literature Review Presentation
PPTX
Implementing a Corpus for Sinhala Language
PDF
Linked Data Publication of Live Music Archives
PDF
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization
PDF
Python Data types properties
PDF
TPDL 2016 Doctoral Consortium - Web Archive Profiling
PPTX
Demystifying RDF
PPTX
Dynamic websites
Sinmin Literature Review Presentation
Implementing a Corpus for Sinhala Language
Linked Data Publication of Live Music Archives
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization
Python Data types properties
TPDL 2016 Doctoral Consortium - Web Archive Profiling
Demystifying RDF
Dynamic websites

What's hot (6)

PDF
Performance neo4j-versus (2)
PDF
RDF Seminar Presentation
PPTX
PPTX
Services semantic technology_terminology
PDF
NoSQL
PPTX
Semantic Variation Graphs the case for RDF & SPARQL
Performance neo4j-versus (2)
RDF Seminar Presentation
Services semantic technology_terminology
NoSQL
Semantic Variation Graphs the case for RDF & SPARQL
Ad

Viewers also liked (11)

PDF
SinMin - Sinhala Corpus Project - Thesis
PDF
G.C.E O/L ICT Short Notes Grade-11
PDF
G.C.E. O/L ICT Lessons Database sinhala
PPSX
දත්ත සහ තොරතුරු
PDF
Grade 10 ICT Short Notes in Sinhala(2015)
PPTX
Cellsppt presentation-100813001954-phpapp02
PPT
Animal and Plant Cells
PPTX
Power point presentation of animal cell and plant cell
PDF
Cell : Structure and Function Part 01
PPT
Cell Structure And Function
PPT
Cells Powerpoint Presentation
SinMin - Sinhala Corpus Project - Thesis
G.C.E O/L ICT Short Notes Grade-11
G.C.E. O/L ICT Lessons Database sinhala
දත්ත සහ තොරතුරු
Grade 10 ICT Short Notes in Sinhala(2015)
Cellsppt presentation-100813001954-phpapp02
Animal and Plant Cells
Power point presentation of animal cell and plant cell
Cell : Structure and Function Part 01
Cell Structure And Function
Cells Powerpoint Presentation
Ad

Similar to Sinmin final presentation (20)

PDF
FIRE2014_IIT-P
PPTX
Scholarly Work 02. Corpus! Thy Name is KD.pptx
PDF
Spell checker for Kannada OCR
PDF
C4 balajiprasath
PDF
Survey on Indian CLIR and MT systems in Marathi Language
PPTX
Dictionary project report.docx
PDF
Fq2510361043
PPTX
What can corpus software do? Routledge chpt 11
PPTX
Information retrieval based on word sens 1
PDF
September 2022: Top 10 Read Articles in Natural Language Computing
PDF
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
PDF
Research: Developing an Interactive Web Information Retrieval and Visualizati...
PDF
Practical Corpus Linguistics An Introduction to Corpus-Based Language Analysi...
PPTX
Automatic term extraction of dynamically updated text collections for sentime...
PDF
Computational linguistics
PDF
Syntactic Indexes for Text Retrieval
PDF
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PDF
New approaches in music generation from tonal and modal perspectives
PDF
Class Diagram Extraction from Textual Requirements Using NLP Techniques
PDF
D017232729
FIRE2014_IIT-P
Scholarly Work 02. Corpus! Thy Name is KD.pptx
Spell checker for Kannada OCR
C4 balajiprasath
Survey on Indian CLIR and MT systems in Marathi Language
Dictionary project report.docx
Fq2510361043
What can corpus software do? Routledge chpt 11
Information retrieval based on word sens 1
September 2022: Top 10 Read Articles in Natural Language Computing
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Practical Corpus Linguistics An Introduction to Corpus-Based Language Analysi...
Automatic term extraction of dynamically updated text collections for sentime...
Computational linguistics
Syntactic Indexes for Text Retrieval
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
New approaches in music generation from tonal and modal perspectives
Class Diagram Extraction from Textual Requirements Using NLP Techniques
D017232729

More from Chamila Wijayarathna (17)

PPTX
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
PPTX
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
PPTX
GS0C - "How to Start" Guide
PDF
Xbotix 2014 Rules undergraduate category
PDF
Kaggle KDD Cup Report
PDF
Higgs Boson Machine Learning Challenge Report
PPTX
Programs With Common Sense
PDF
Knock detecting door lock research paper
PDF
IEEE Xtreme Final results 2012
PPTX
Helen Keller, The Story of My Life
PDF
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
PDF
Ieee xtreme 5.0 results
DOCX
Memory technologies
DOCX
History of Computer
DOC
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
DOCX
Path Following Robot
PPTX
Path following robot
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
GS0C - "How to Start" Guide
Xbotix 2014 Rules undergraduate category
Kaggle KDD Cup Report
Higgs Boson Machine Learning Challenge Report
Programs With Common Sense
Knock detecting door lock research paper
IEEE Xtreme Final results 2012
Helen Keller, The Story of My Life
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Ieee xtreme 5.0 results
Memory technologies
History of Computer
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
Path Following Robot
Path following robot

Recently uploaded (20)

PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PDF
Principles of operation, construction, theory, advantages and disadvantages, ...
PDF
[jvmmeetup] next-gen integration with apache camel and quarkus.pdf
PDF
electrical machines course file-anna university
PDF
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
PPTX
Software Engineering and software moduleing
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PPTX
Cisco Network Behaviour dibuywvdsvdtdstydsdsa
PPTX
CT Generations and Image Reconstruction methods
PDF
Unit1 - AIML Chapter 1 concept and ethics
PDF
Mechanics of materials week 2 rajeshwari
PPTX
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
PPTX
MAD Unit - 3 User Interface and Data Management (Diploma IT)
DOCX
ENVIRONMENTAL PROTECTION AND MANAGEMENT (18CVL756)
PDF
LOW POWER CLASS AB SI POWER AMPLIFIER FOR WIRELESS MEDICAL SENSOR NETWORK
PPTX
Agentic Artificial Intelligence (Agentic AI).pptx
PDF
AIGA 012_04 Cleaning of equipment for oxygen service_reformat Jan 12.pdf
PPTX
Micro1New.ppt.pptx the mai themes of micfrobiology
PPTX
Solar energy pdf of gitam songa hemant k
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
Principles of operation, construction, theory, advantages and disadvantages, ...
[jvmmeetup] next-gen integration with apache camel and quarkus.pdf
electrical machines course file-anna university
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
Software Engineering and software moduleing
August -2025_Top10 Read_Articles_ijait.pdf
distributed database system" (DDBS) is often used to refer to both the distri...
Cisco Network Behaviour dibuywvdsvdtdstydsdsa
CT Generations and Image Reconstruction methods
Unit1 - AIML Chapter 1 concept and ethics
Mechanics of materials week 2 rajeshwari
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
MAD Unit - 3 User Interface and Data Management (Diploma IT)
ENVIRONMENTAL PROTECTION AND MANAGEMENT (18CVL756)
LOW POWER CLASS AB SI POWER AMPLIFIER FOR WIRELESS MEDICAL SENSOR NETWORK
Agentic Artificial Intelligence (Agentic AI).pptx
AIGA 012_04 Cleaning of equipment for oxygen service_reformat Jan 12.pdf
Micro1New.ppt.pptx the mai themes of micfrobiology
Solar energy pdf of gitam songa hemant k

Sinmin final presentation

  • 1. Sinmin - Corpus for Sinhala Language 1 Upeksha W. D. Wijayarathna D. G. C. D. Siriwardena M. P. Lasandun K. H. L. Supervisors : Dr. Chinthana Wimalasuriya Prof. Gihan Dias Mr. N. H. N. D. de Silva
  • 2. Outline ● Introduction ● Crawler Implementation and Design ● Data Cleaning and Tokenizing Mechanisms ● Selecting Data Storage Mechanism ● Data Storage Model of SinMin ● User Interface Design and Implementation ● API Design and Implementation ● Unit Testing ● Performance Testing of the API ● Implemented Sample Usages 2
  • 3. What is a Corpus?? “A corpus is a principled collection of authentic texts stored electronically that can be used to discover information about language that may not have been noticed through intuition alone.” - Bennet (2010) 3
  • 4. Usages of a Corpus ● Implementing translators, spell checkers and grammar checkers. ● Identifying lexical and grammatical features of a language. ● Identifying varieties of language of context of usage and time. ● Retrieving statistical details of a language. ● Providing backend support for tools like OCR, POS Tagger, etc. 4
  • 5. Sinmin is a Corpus for Sinhala language which is ➢ Continuously updating ➢ Dynamic (Scalable) ➢ Covers wide range of language (Structured and unstructured) 5
  • 7. Identified Sinhala Resources 7 News Academic Creative Writing Spoken Gazette News Paper Text books Fiction Subtitle Gazette News Items Religious Blogs Wikipedia Magazine mahawansa
  • 10. Crawlers are responsible of finding web pages that contain sinhala content, fetching, parsing and storing them in a manageable format. 10
  • 12. Sample Xml File With One Article Stored In It 12
  • 13. Crawler Controller Crawler controller monitors and handles the status of the web crawlers. 13
  • 14. 14
  • 15. Data Cleaning and Tokenizing Mechanisms used 15
  • 16. Identified Issues ● Erroneous characters of the texts ● Short forms ● Consecutive Sinhala vowel sign problem fixing 16
  • 17. Erroneous Characters Of The Texts ● Invalid Unicode characters Eg: Characters in a private user area, Replacement character ● Symbols Eg: “,”, “.”, “{“, “(“, “?” 17
  • 18. Erroneous Characters Of The Texts ● Unwanted non-Sinhala characters Eg: ‘u+200C’, Á, À, ®, ¡, ª, º ● Non-symbolic characters which were terminating words 18
  • 19. Short Forms ● Short forms consists of full stops. ● But those full stop marks aren’t separating sentences nor words. E.g.: පෙ. ව. (pm), රු. (Rupees) 19
  • 20. Identified Common Short Forms "ඒ.", "බී.", "සී.", "ඩී.", "ඊ.", "එෆ්." "පෙ.", "ව.", "ෙ.", "රු." "0.", "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9." 20
  • 21. Consecutive Sinhala Vowel Sign Problem 21
  • 22. Consecutive Sinhala Vowel Sign Problem ● Solution: Mapping them into one format ● Convention: Only one vowel sign to a Sinhala letter 22
  • 23. Consecutive Sinhala Vowel Sign Problem 23
  • 25. The performance of data insertion and retrieval mainly depend on the Data Storage Mechanism used for the Corpus. 25
  • 26. We tested performance of several database systems to determine what should we use to store data. 26
  • 27. We Considered Following Data Storage Systems 27
  • 28. We considered performance for inserting data and for retrieving 12 different information needs. Data set and source code https://github.com/madurangasiriwardena/performance-test 28
  • 29. Data Insertion Time Comparison 29
  • 32. Cassandra performed better than others in most of the scenarios, and its insertion time increased linearly. So we chose it for implementing corpus. 32
  • 33. Data Storage Model of Sinmin 33
  • 34. ● We Used Cassandra as the Main Storage System of Sinmin ● Apache Cassandra version 2.1.2 used. ● cqlsh version 5.0.1 used 34 Cassandra
  • 35. ● Most queries of API are retrieved from Cassandra Database. ● Cassandra Database consist of more than 50 Column Families where each of them provides a specific information need 35 Cassandra
  • 36. ● Oracle used as a backup storage server. 36 Oracle
  • 38. Wildcard Search Feature Wildcard search feature enables users to run wild- card queries on the corpus Eg: පෙ? ෙහ* 38
  • 39. Wildcard Search Feature ● Implemented using Apache Solr ● More than 1.2 million distinct words ● Supports at most 10 asterisks and atmost 10 question marks 39
  • 40. Sinhala Vowel Sign Problem At Wildcard Search In Sinhala Unicode, Sinhala vowel signs are separate Unicode characters 40
  • 41. Sinhala Vowel Sign Problem At Wildcard Search Solution: Represent Sinhala letter and vowel sign as one entity 41
  • 42. User Interface Design and Implementation 42
  • 43. ● Web interface of Sinmin has been designed for users who would prefer a visualised and summarized view of statistical data of Sinmin. ● Visual design of the interface has been made in a way that any user without prior experience of the interface is able to fulfill his information requirements with little effort. 43
  • 44. Sinmin user interface allows to, ● Find the probability of an n-gram ● Find the most probable word comes after an n-gram ● Compare the usage of n-grams ● Find statistics of words, bigrams and trigrams ● Wildcard search ● Find latest articles for an n-gram 44
  • 45. 45
  • 46. 46
  • 47. API Design and Implementation 47
  • 48. REST API ● REST API to expose Corpus services ● Much complex and customizable data retrieval and filtering ● Interface for third party applications to consume 48
  • 49. REST API ● Depends on backend databases (Cassandra, Oracle, Solr) ● Cassandra acts as main storage system ● Oracle is used as a backup database ● Solr is used for wildcard search functions 49
  • 51. API Functions ● wordFrequency ● bigramFrequency ● trigramFrequency ● frequentWords ● frequentBigrams ● frequentTrigrams ● latestArticlesForWord ● latestArticlesForBigram ● latestArticlesForTrigram 51 ● frequentWordsAroundWord ● frequentWordsInPosition ● frequentWordsInPositionReverse ● frequentWordsAfterWordTimeRange ● frequentWordsAfterBigramTimeRange ● wordCount ● bigramCount ● trigramCount
  • 53. Throughput Under Different Load Conditions 53
  • 54. Time Taken To Process Requests Under Different Load Conditions 54
  • 55. Full Stop Predictor For OCR ● One challenge in OCR development is identifying fullstops. ● This tool is a consumer application of Sinmin that predicts the full stop marks of Sinhala texts. 55
  • 56. Publications ● Implementing a Corpus for Sinhala Language - Symposium on Language Technology for South Asia (Presented) ● Comparison between performance of various database systems for implementing a language corpus – 11th International Beyond Databases, Architectures and Structures conference (Accepted) 56
  • 57. Future Works ● Annotate Words with POS Taggers and lemmas. ● Implement tools and applications that make use of the corpus 57