SlideShare a Scribd company logo
TURKISH LANGUAGE MODELING
Chaza Alkis, Abdurrahim Derric
Department of computer engineering
Yildiz Technical University, 34220 Istanbul, Türkiye
shaza.alqays@hotmail.com, abdelrahimdarrige@gmail.com
Abstract—Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
Keywords—Statistical Language Modelling, Neural Language
Models, Markov Chain, Artificial Neural Networks, Deep Learn-
ing, BERT.
I. INTRODUCTION
Our project is a new technique to guess the appropriate
word in a certain sentence, in this regard to get a good
result we studied some models and tested on the Turkish
language, including the statistical language modeling and
neural language models.
II. LANGUAGE MODELING
Language modeling is central to many important natural
language processing tasks.
III. STATISTICAL LANGUAGE MODELING
A statistical language model SLM is a probability
distribution over sequences of words.
The language model learns the probability of words
occurring based on examples of text. Simpler models may
appear in the context of a short series of words, while larger
models may work at the level of sentences or paragraphs.
Most commonly, language models work at the word level.
The language model can be developed and used
independently, such as creating new sequences of text
that appear to come from the set of documents.
Language modeling is an essential problem for a wide
range of natural language processing tasks. In a more
practical way, language models are used in the front or
back of a more sophisticated model for a task that requires
understanding the language.
Developing better language models often results in models
that perform better in the intended natural language
processing task. This is the motivation for developing better
and more accurate language models, [1].
IV. NEURAL LANGUAGE MODELS
Recently, the use of neural networks in the development
of linguistic models has become so popular that it may now
be the preferred approach.
The use of neural networks in language modeling is often
called Neural Language Modeling, or NLM for short.
Neural network approaches yield better results than classic
methods in independent language models and when models
are incorporated into larger models in challenging tasks
such as speech recognition and machine translation.
The main reason behind the improvements in performance
may be the ability of the method to generalize.
Specifically, an inclusive word that uses the real value
vector to represent each word in the project vector space is
approved. This learned representation of words based on
their use of words with a similar meaning allows to have a
similar representation.
This generalization is something that is not easily achievable
in linguistic representation in classical statistical language
models.
Furthermore, the distributed representation approach allows
for better representation of inclusion in measurement with
vocabulary size. Classical methods with one separate
representation of each word fight dimensional curse with
larger and larger vocabulary of words that lead to longer
and more separate representations.
The neural network approach to language modeling can be
described using the three following model properties:
• Associate each word in the vocabulary with a
distributed word feature vector.
• Express the joint probability function of word
sequences in terms of the feature vectors of these
words in the sequence.
• Learn simultaneously the word feature vector and
the parameters of the probability function.
This represents a relatively simple model where both
representation and probability model are learned together
directly from raw text data.
Recently, neurotic based approaches have begun and
consistently outperformed classical statistical approaches.
V. MODELS STUDY
A. Markov chain
A Markov chain is a stochastic model describing a
sequence of possible events in which the probability of each
event depends only on the state attained in the previous
event.
More formally, a separate Markov chain is a series of
random variables X1, X2, X3, ... that satisfies the Markov
feature - the probability of moving from the current state
to the next state depends only on the current state.
With respect to probability distribution, given that the
system is at the right time n, the conditional distribution
of states in the next instance, n + 1, is conditionally
independent of the state of the system in temporal cases 1,
2,. . . , n-1.
This can be written as follows:
Pr(Xn+1 = x|X1 = x1, X2 = x2,..., Xn = xn) =
Pr(Xn+1 = x|Xn = xn)
1)Markov chain graph representation: Markov chains are
often represented using vector diagrams. The nodes in the
vector diagrams represent the various possible states of
random variables, while the edges represent the probability
that the system will move from one state to another the next
time.
For example, in the weather forecast there are three possible
states for the random variable Weather = Sunny, Rainy,
Snowy, and possible Markov chains can be represented as
shown in the Figure 1 One of the main points to understand
Figure 1 Markov chain graph representation
in Markov chains is that you design the results of a series
of random variables over time. The nodes in the above
graph represent the different weather condition, and the
edges between them show the possibility that the next
random variable will change as many different states as
possible, given the condition of the current random variable.
Self-loops show the probability that the model will remain
in its current state.
In the Markov series above, the observed state of the current
random variable is Sunny. Then, the probability that the
random variable will take an instance of next time is Sunny
is 0.8. It may also take Rainy with a probability of 0.19 or
Snowy with a probability of 0.01.
2)Parameterization of Markov chains: Another way to
represent state transitions is to use a transition matrix.
The transition matrix, as the name implies, uses a tabular
representation of the transition probabilities.
The following table shows the transition matrix for the
Markov chain shown in Figure 1. The probability values
represent the probability of the system going from the state
in the row to the states mentioned in the columns, see Table
1.
Table 1 Transition matrix
state sunny rainy snowy
sunny 0.8 0.19 0.01
rainy 0.2 0.7 0.1
snowy 0.1+ 0.2 0.7
B. BERT
Bidirectional Encoder Representations from Transformers
(BERT) is a technique for NLP (Natural Language
Processing) pre-training developed by Google.
Modern NLP models based on deep learning see benefits
from much larger amounts of data, which improve upon
training in millions, or billions, from examples of annotated
training. To help fill this gap in the data, researchers have
developed a variety of techniques to train general purpose
language models using a massive amount of unexplained
text on the web (known as pre-training) as BERT.
1)Why BERT is different: BERT is the first non-supervised
bi-directional linguistic representation, pre-trained with a
plain story block.
For example, in the sentence "You have accessed the bank
account", a one-way contextual model would represent
"bank" based on "you have accessed" but not "account."
However, BERT represents a "bank" using both its previous
and next context - "I have accessed ... account" - starting
from below the deep neural network, making it deeply
bidirectional [2].
2)Masked language modelig: BERT has been pre-trained
on masked language modeling and next sentence prediction
(next sentence prediction will be explained in next section).
Masked language modeling is the task of predicting the next
word given a sequence of words. In masked language mod-
eling instead of predicting every next token, a percentage
of input tokens is masked at random and only those masked
tokens are predicted.
The masked words are not always replaced with the masked
token – [MASK] because then the masked tokens would
never be seen before fine-tuning. Therefore,
• 15% of the tokens are chosen at random.
• 80% of the time tokens are actually replaced with
the token [MASK].
• 10% of the time tokens are replaced with a random
token.
• 10% of the time tokens are left unchanged.
3)Next sentence prediction: The missing word is
predicted, if the next word is the same as missing then the
model made a right guess, for example:
Input = [CLS] the man want to [MASK] store [SEP]
he bought a gallon [MASK] milk [SEP]
Label = IsNext
Input = [CLS] the man [MASK] to the store [SEP]
penguin [MASK] are flight less birds [SEP]
Label = NotNext
This task can be easily created from any single language
group. It is useful because many of the downstream tasks
such as question and answer and reasoning of natural
language require understanding the relationship between
two sentences.
4)Input text presentation before feeding to BERT: The
input representation used by BERT is capable of representing
a single text sentence as well as a pair of sentences
(for example, [Question, Answer]) in a single sequence of
symbols.
• The first token of every input sequence is the
special classification token – [CLS]. This token is
used in classification tasks as an aggregate of the
entire sequence representation. It is ignored in
non-classification tasks.
• For single text sentence tasks, this [CLS] token is
followed by the WordPiece tokens and the separator
token – [SEP],
[CLS] my cat is very good [SEP]
• For sentence pair tasks, the WordPiece tokens of the
two sentences are separated by another [SEP] token.
This input sequence also ends with the [SEP] token,
[CLS] my cat is cute [SEP] he likes play ing [SEP]
• A sentence referring to sentence A or sentence B
is added to each symbol. Decorations are similar to
symbols / word decorations with vocabulary 2.
• A positional embedding is also added to each token
to indicate its position in the sequence.
BERT uses the symbolism of WordPiece. The vocabulary is
initialized with all the individual letters of the language,
hence the most common / most likely groups of words in
the vocabulary are added frequently.
Any word that does not occur in the vocabulary is broken
down into sub-words greedily. For example, if play, ing, and
ed are present in the vocabulary but playing and played are
OOV words then they will be broken down into play + ing
and play + ed respectively. ( is used to represent sub-words).
And the maximum sequenced length of the input is 512
tokens, [3].
VI. RESULTS ANALYSING
A. Markov Cahin model dataset size effect comparison
Here we compare the affect of the size of the dataset,
we notice that when we use large dataset there is slight
improvement. It’s expected result because some words may
not be found in the dataset and as the dataset be larger as
we find more words. Also the best effect shown in order 1,
see Figure 2.
Figure 2 20K - 40K - 100K datasets comparison
B. Smoothing algorithms comparison
Smoothing is searching for the result by passing through
third order to second order back to first order, we noticed
that it had good effect on the result, see Figure 3.
Figure 3 Smoothed - Unsmoothed algorithms comparison
C. BERT model results comparison
Here we will compare BERT results through 20k, 40k
and 100k, we notice that the higher dataset size effect the
most, see Figure 4
Figure 4 BERT results comparison
D. BERT vs Google Multilingual
Here we will compare BERT and google multilingual,We
notice that Multilingual gives much lower results than our
BERT model, and it is unsuccessful in finding the missing
word because it focuses on more than 100 languages and
cannot focus on one language. As for our BERT model, it
is learning on the Turkish language alone, so its ability to
link Turkish words and the meanings between sentences
are stronger and this is the reason for the big difference
in results, see Figure 5
Figure 5 BERT vs Multilingual
E. Comparison of statistical language modeling and neural
language model
Here we will see the effect of the training dataset size in
each moddel, and the accuracy of each by comparing results
of top1 and top5.
By comparing Markov and BERT we find the Figure 6 which
means that BERT gives higher results than Markov chain
when dataset size going bigger, in this figure we use 3
datasets and see how they effect.
Figure 6 BERT vs Markov Chain
VII. CONCLUSION
From our study and previous studies, we notice that the
statistical language modeling, although it is considered an
old technique compared to BERT’s deep learning model, still
gives good results.
We notice that BERT, although it is deep learning model, did
not succeed much because the language contains hundreds
of thousands of words and these words may be names or
verbs with different terms, and they can also be found in
different locations of the sentence and this gives us millions
of possibilities.
So all from the statistical language modeling to the natural
language Models, it gets results of approximately 30 percent
or 40 percent of guesses.
Based on the graphics that we extracted from our study,
we see that the size of the dataset greatly affects the
probability of guesswork, so in the future a larger size of
the dataset and new techniques that improve the computer’s
understanding of the language can be used. But in return,
increasing the size of the dataset will lead to an increase in
mathematical operations, for example, when the size of the
dataset was 100K, the operating time was approximately 56
hours. Assuming we would have a million-volume data, the
operation is expected to be lengthened for months using the
current processors.
REFERENCES
[1] J. Brownlee. (2017) Gentle introduction to statistical
language modeling and neural language models. [Online].
Available: https://machinelearningmastery.com/statistical-language-
modeling-and-neural-language-models/
[2] J. Devlin and M.-W. Chang. (2018) Open sourcing bert: State-
of-the-art pre-training for natural language processing. [Online].
Available: https://ai.googleblog.com/2018/11/open-sourcing-bert-
state-of-art-pre.html
[3] Y. SETH. (2019) Bert explained. [Online].
Available: https://yashuseth.blog/2019/06/12/bert-explained-faqs-
understand-bert-working/

More Related Content

What's hot (17)

PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
PDF
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
AbdurrahimDerric
 
PPTX
Intent Classifier with Facebook fastText
Bayu Aldi Yansyah
 
ODP
Nn kb
Kushal Arora
 
PDF
Neural Network in Knowledge Bases
Kushal Arora
 
PPTX
Reasoning Over Knowledge Base
Shubham Agarwal
 
PDF
Plug play language_models
Mohammad Moslem Uddin
 
PDF
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
ijnlc
 
PDF
text summarization using amr
amit nagarkoti
 
PPTX
Language models
Maryam Khordad
 
PDF
Analyzing individual neurons in pre trained language models
ken-ando
 
PDF
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
ijnlc
 
PDF
ENSEMBLE MODEL FOR CHUNKING
ijasuc
 
PDF
10.1.1.35.8376
Mahmoud Abdullah
 
PDF
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
ijnlc
 
PPTX
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
Masahiro Kaneko
 
PDF
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURES
ijnlc
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
AbdurrahimDerric
 
Intent Classifier with Facebook fastText
Bayu Aldi Yansyah
 
Neural Network in Knowledge Bases
Kushal Arora
 
Reasoning Over Knowledge Base
Shubham Agarwal
 
Plug play language_models
Mohammad Moslem Uddin
 
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
ijnlc
 
text summarization using amr
amit nagarkoti
 
Language models
Maryam Khordad
 
Analyzing individual neurons in pre trained language models
ken-ando
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
ijnlc
 
ENSEMBLE MODEL FOR CHUNKING
ijasuc
 
10.1.1.35.8376
Mahmoud Abdullah
 
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
ijnlc
 
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
Masahiro Kaneko
 
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURES
ijnlc
 

Similar to Turkish language modeling using BERT (20)

PDF
Contemporary Models of Natural Language Processing
Katerina Vylomova
 
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Seonghyun Kim
 
PDF
BERT Explained_ State of the art language model for NLP.pdf
sudeshnakundu10
 
PDF
BERT - Part 1 Learning Notes of Senthil Kumar
Senthil Kumar M
 
PPT
Natural Language Processing: N-Gram Language Models
JCGonzaga1
 
PPT
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
varshakumari296060
 
PPT
Natural Language Processing: N-Gram Language Models
vardadhande
 
PPTX
A Neural Probabilistic Language Model
Rama Irsheidat
 
PDF
Visual-Semantic Embeddings: some thoughts on Language
Roelof Pieters
 
DOCX
Language Modeling.docx
AnuradhaRaheja1
 
PDF
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Universitat Politècnica de Catalunya
 
PDF
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
PPTX
Text generation and_advanced_topics
ankit_ppt
 
PDF
Deep learning for natural language embeddings
Roelof Pieters
 
PDF
Language Modelling in Natural Language Processing-Part I.pdf
Deptii Chaudhari
 
PPT
2-Chapter Two-N-gram Language Models.ppt
milkesa13
 
PPTX
Natural Language Processing detailed description
guptashivani271997
 
DOCX
Machine Learning
Apurva Mittal
 
PDF
Lectures 10-11_ Representation Capacity – Large Language Models.pdf
waboj55600
 
PPTX
Natural language processing and transformer models
Ding Li
 
Contemporary Models of Natural Language Processing
Katerina Vylomova
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Seonghyun Kim
 
BERT Explained_ State of the art language model for NLP.pdf
sudeshnakundu10
 
BERT - Part 1 Learning Notes of Senthil Kumar
Senthil Kumar M
 
Natural Language Processing: N-Gram Language Models
JCGonzaga1
 
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
varshakumari296060
 
Natural Language Processing: N-Gram Language Models
vardadhande
 
A Neural Probabilistic Language Model
Rama Irsheidat
 
Visual-Semantic Embeddings: some thoughts on Language
Roelof Pieters
 
Language Modeling.docx
AnuradhaRaheja1
 
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Universitat Politècnica de Catalunya
 
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Text generation and_advanced_topics
ankit_ppt
 
Deep learning for natural language embeddings
Roelof Pieters
 
Language Modelling in Natural Language Processing-Part I.pdf
Deptii Chaudhari
 
2-Chapter Two-N-gram Language Models.ppt
milkesa13
 
Natural Language Processing detailed description
guptashivani271997
 
Machine Learning
Apurva Mittal
 
Lectures 10-11_ Representation Capacity – Large Language Models.pdf
waboj55600
 
Natural language processing and transformer models
Ding Li
 
Ad

Recently uploaded (20)

PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Ad

Turkish language modeling using BERT

  • 1. TURKISH LANGUAGE MODELING Chaza Alkis, Abdurrahim Derric Department of computer engineering Yildiz Technical University, 34220 Istanbul, Türkiye [email protected], [email protected] Abstract—Our project is about guessing the correct missing word in a given sentence. To find of guess the missing word we have two main methods one of them statistical language modeling, while the other is neural language models. Statistical language modeling depend on the frequency of the relation between words and here we use Markov chain. Since neural language models uses artificial neural networks which uses deep learning, here we use BERT which is the state of art in language modeling provided by google. Keywords—Statistical Language Modelling, Neural Language Models, Markov Chain, Artificial Neural Networks, Deep Learn- ing, BERT. I. INTRODUCTION Our project is a new technique to guess the appropriate word in a certain sentence, in this regard to get a good result we studied some models and tested on the Turkish language, including the statistical language modeling and neural language models. II. LANGUAGE MODELING Language modeling is central to many important natural language processing tasks. III. STATISTICAL LANGUAGE MODELING A statistical language model SLM is a probability distribution over sequences of words. The language model learns the probability of words occurring based on examples of text. Simpler models may appear in the context of a short series of words, while larger models may work at the level of sentences or paragraphs. Most commonly, language models work at the word level. The language model can be developed and used independently, such as creating new sequences of text that appear to come from the set of documents. Language modeling is an essential problem for a wide range of natural language processing tasks. In a more practical way, language models are used in the front or back of a more sophisticated model for a task that requires understanding the language. Developing better language models often results in models that perform better in the intended natural language processing task. This is the motivation for developing better and more accurate language models, [1]. IV. NEURAL LANGUAGE MODELS Recently, the use of neural networks in the development of linguistic models has become so popular that it may now be the preferred approach. The use of neural networks in language modeling is often called Neural Language Modeling, or NLM for short. Neural network approaches yield better results than classic methods in independent language models and when models are incorporated into larger models in challenging tasks such as speech recognition and machine translation. The main reason behind the improvements in performance may be the ability of the method to generalize. Specifically, an inclusive word that uses the real value vector to represent each word in the project vector space is approved. This learned representation of words based on their use of words with a similar meaning allows to have a similar representation. This generalization is something that is not easily achievable in linguistic representation in classical statistical language models. Furthermore, the distributed representation approach allows for better representation of inclusion in measurement with vocabulary size. Classical methods with one separate representation of each word fight dimensional curse with larger and larger vocabulary of words that lead to longer and more separate representations. The neural network approach to language modeling can be described using the three following model properties: • Associate each word in the vocabulary with a distributed word feature vector. • Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence. • Learn simultaneously the word feature vector and the parameters of the probability function. This represents a relatively simple model where both representation and probability model are learned together directly from raw text data. Recently, neurotic based approaches have begun and consistently outperformed classical statistical approaches. V. MODELS STUDY A. Markov chain A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. More formally, a separate Markov chain is a series of
  • 2. random variables X1, X2, X3, ... that satisfies the Markov feature - the probability of moving from the current state to the next state depends only on the current state. With respect to probability distribution, given that the system is at the right time n, the conditional distribution of states in the next instance, n + 1, is conditionally independent of the state of the system in temporal cases 1, 2,. . . , n-1. This can be written as follows: Pr(Xn+1 = x|X1 = x1, X2 = x2,..., Xn = xn) = Pr(Xn+1 = x|Xn = xn) 1)Markov chain graph representation: Markov chains are often represented using vector diagrams. The nodes in the vector diagrams represent the various possible states of random variables, while the edges represent the probability that the system will move from one state to another the next time. For example, in the weather forecast there are three possible states for the random variable Weather = Sunny, Rainy, Snowy, and possible Markov chains can be represented as shown in the Figure 1 One of the main points to understand Figure 1 Markov chain graph representation in Markov chains is that you design the results of a series of random variables over time. The nodes in the above graph represent the different weather condition, and the edges between them show the possibility that the next random variable will change as many different states as possible, given the condition of the current random variable. Self-loops show the probability that the model will remain in its current state. In the Markov series above, the observed state of the current random variable is Sunny. Then, the probability that the random variable will take an instance of next time is Sunny is 0.8. It may also take Rainy with a probability of 0.19 or Snowy with a probability of 0.01. 2)Parameterization of Markov chains: Another way to represent state transitions is to use a transition matrix. The transition matrix, as the name implies, uses a tabular representation of the transition probabilities. The following table shows the transition matrix for the Markov chain shown in Figure 1. The probability values represent the probability of the system going from the state in the row to the states mentioned in the columns, see Table 1. Table 1 Transition matrix state sunny rainy snowy sunny 0.8 0.19 0.01 rainy 0.2 0.7 0.1 snowy 0.1+ 0.2 0.7 B. BERT Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP (Natural Language Processing) pre-training developed by Google. Modern NLP models based on deep learning see benefits from much larger amounts of data, which improve upon training in millions, or billions, from examples of annotated training. To help fill this gap in the data, researchers have developed a variety of techniques to train general purpose language models using a massive amount of unexplained text on the web (known as pre-training) as BERT. 1)Why BERT is different: BERT is the first non-supervised bi-directional linguistic representation, pre-trained with a plain story block. For example, in the sentence "You have accessed the bank account", a one-way contextual model would represent "bank" based on "you have accessed" but not "account." However, BERT represents a "bank" using both its previous and next context - "I have accessed ... account" - starting from below the deep neural network, making it deeply bidirectional [2]. 2)Masked language modelig: BERT has been pre-trained on masked language modeling and next sentence prediction (next sentence prediction will be explained in next section). Masked language modeling is the task of predicting the next word given a sequence of words. In masked language mod- eling instead of predicting every next token, a percentage of input tokens is masked at random and only those masked tokens are predicted. The masked words are not always replaced with the masked token – [MASK] because then the masked tokens would never be seen before fine-tuning. Therefore, • 15% of the tokens are chosen at random. • 80% of the time tokens are actually replaced with the token [MASK]. • 10% of the time tokens are replaced with a random token. • 10% of the time tokens are left unchanged.
  • 3. 3)Next sentence prediction: The missing word is predicted, if the next word is the same as missing then the model made a right guess, for example: Input = [CLS] the man want to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = IsNext Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight less birds [SEP] Label = NotNext This task can be easily created from any single language group. It is useful because many of the downstream tasks such as question and answer and reasoning of natural language require understanding the relationship between two sentences. 4)Input text presentation before feeding to BERT: The input representation used by BERT is capable of representing a single text sentence as well as a pair of sentences (for example, [Question, Answer]) in a single sequence of symbols. • The first token of every input sequence is the special classification token – [CLS]. This token is used in classification tasks as an aggregate of the entire sequence representation. It is ignored in non-classification tasks. • For single text sentence tasks, this [CLS] token is followed by the WordPiece tokens and the separator token – [SEP], [CLS] my cat is very good [SEP] • For sentence pair tasks, the WordPiece tokens of the two sentences are separated by another [SEP] token. This input sequence also ends with the [SEP] token, [CLS] my cat is cute [SEP] he likes play ing [SEP] • A sentence referring to sentence A or sentence B is added to each symbol. Decorations are similar to symbols / word decorations with vocabulary 2. • A positional embedding is also added to each token to indicate its position in the sequence. BERT uses the symbolism of WordPiece. The vocabulary is initialized with all the individual letters of the language, hence the most common / most likely groups of words in the vocabulary are added frequently. Any word that does not occur in the vocabulary is broken down into sub-words greedily. For example, if play, ing, and ed are present in the vocabulary but playing and played are OOV words then they will be broken down into play + ing and play + ed respectively. ( is used to represent sub-words). And the maximum sequenced length of the input is 512 tokens, [3]. VI. RESULTS ANALYSING A. Markov Cahin model dataset size effect comparison Here we compare the affect of the size of the dataset, we notice that when we use large dataset there is slight improvement. It’s expected result because some words may not be found in the dataset and as the dataset be larger as we find more words. Also the best effect shown in order 1, see Figure 2. Figure 2 20K - 40K - 100K datasets comparison B. Smoothing algorithms comparison Smoothing is searching for the result by passing through third order to second order back to first order, we noticed that it had good effect on the result, see Figure 3. Figure 3 Smoothed - Unsmoothed algorithms comparison C. BERT model results comparison Here we will compare BERT results through 20k, 40k and 100k, we notice that the higher dataset size effect the most, see Figure 4 Figure 4 BERT results comparison
  • 4. D. BERT vs Google Multilingual Here we will compare BERT and google multilingual,We notice that Multilingual gives much lower results than our BERT model, and it is unsuccessful in finding the missing word because it focuses on more than 100 languages and cannot focus on one language. As for our BERT model, it is learning on the Turkish language alone, so its ability to link Turkish words and the meanings between sentences are stronger and this is the reason for the big difference in results, see Figure 5 Figure 5 BERT vs Multilingual E. Comparison of statistical language modeling and neural language model Here we will see the effect of the training dataset size in each moddel, and the accuracy of each by comparing results of top1 and top5. By comparing Markov and BERT we find the Figure 6 which means that BERT gives higher results than Markov chain when dataset size going bigger, in this figure we use 3 datasets and see how they effect. Figure 6 BERT vs Markov Chain VII. CONCLUSION From our study and previous studies, we notice that the statistical language modeling, although it is considered an old technique compared to BERT’s deep learning model, still gives good results. We notice that BERT, although it is deep learning model, did not succeed much because the language contains hundreds of thousands of words and these words may be names or verbs with different terms, and they can also be found in different locations of the sentence and this gives us millions of possibilities. So all from the statistical language modeling to the natural language Models, it gets results of approximately 30 percent or 40 percent of guesses. Based on the graphics that we extracted from our study, we see that the size of the dataset greatly affects the probability of guesswork, so in the future a larger size of the dataset and new techniques that improve the computer’s understanding of the language can be used. But in return, increasing the size of the dataset will lead to an increase in mathematical operations, for example, when the size of the dataset was 100K, the operating time was approximately 56 hours. Assuming we would have a million-volume data, the operation is expected to be lengthened for months using the current processors. REFERENCES [1] J. Brownlee. (2017) Gentle introduction to statistical language modeling and neural language models. [Online]. Available: https://machinelearningmastery.com/statistical-language- modeling-and-neural-language-models/ [2] J. Devlin and M.-W. Chang. (2018) Open sourcing bert: State- of-the-art pre-training for natural language processing. [Online]. Available: https://ai.googleblog.com/2018/11/open-sourcing-bert- state-of-art-pre.html [3] Y. SETH. (2019) Bert explained. [Online]. Available: https://yashuseth.blog/2019/06/12/bert-explained-faqs- understand-bert-working/