SlideShare a Scribd company logo
Phases of Syntax Analysis
1. Identify the words: Lexical Analysis.
Converts a stream of characters (input program) into a stream of tokens.
Also called Scanning or Tokenizing.
2. Identify the sentences: Parsing.
Derive the structure of sentences: construct parse trees from a stream of tokens.
Lexical Analysis
Convert a stream of characters into a stream of tokens.
• Simplicity: Conventions about “words” are often different from conventions about “sentences”.
• Efficiency: Word identification problem has a much more efficient solution than sentence identification problem.
• Portability: Character set, special characters, device features.
Terminology
• Token: Name given to a family of words.
e.g., integer constant
• Lexeme: Actual sequence of characters representing a word.
e.g., 32894
• Pattern: Notation used to identify the set of lexemes represented by a token.
e.g., [0 − 9]+
Terminology
A few more examples:
Token Sample Lexemes Pattern
while while while
integer constant 32894, -1093, 0 [0-9]+
identifier buffer size [a-zA-Z]+
Patterns
How do we compactly represent the set of all lexemes corresponding to a token?
For instance:
The token integer constant represents the set of all integers: that is, all sequences of digits (0–9), preceded by an optional
sign (+ or −).
Obviously, we cannot simply enumerate all lexemes.
Use Regular Expressions.
Regular Expressions
Notation to represent (potentially) infinite sets of strings over alphabet Σ.
• a: stands for the set {a} that contains a single string a.
⊲ Analogous to Union.
• ab: stands for the set {ab} that contains a single string ab.
⊲ Analogous to Product.
⊲ (a|b)(a|b): stands for the set {aa, ab, ba, bb}.
• a∗
: stands for the set {ǫ, a, aa, aaa, . . .} that contains all strings of zero or more a’s.
⊲ Analogous to closure of the product operation.
Regular Expressions
Examples of Regular Expressions over {a, b}:
• (a|b)∗
: Set of strings with zero or more a’s and zero or more b’s:
{ǫ, a, b, aa, ab, ba, bb, aaa, aab, . . .}
• (a∗
b∗
): Set of strings with zero or more a’s and zero or more b’s such that all a’s occur before any b:
{ǫ, a, b, aa, ab, bb, aaa, aab, abb, . . .}
• (a∗
b∗
)∗
: Set of strings with zero or more a’s and zero or more b’s:
{ǫ, a, b, aa, ab, ba, bb, aaa, aab, . . .}
Language of Regular Expressions
Let R be the set of all regular expressions over Σ. Then,
• Empty String: ǫ ∈ R
• Unit Strings: α ∈ Σ ⇒ α ∈ R
• Concatenation: r1, r2 ∈ R ⇒ r1r2 ∈ R
• Alternative: r1, r2 ∈ R ⇒ (r1 | r2) ∈ R
• Kleene Closure: r ∈ R ⇒ r∗
∈ R
Regular Expressions
Example: (a | b)∗
L0 = {ǫ}
L1 = L0 · {a, b}
= {ǫ} · {a, b}
= {a, b}
L2 = L1 · {a, b}
= {a, b} · {a, b}
= {aa, ab, ba, bb}
L3 = L2 · {a, b}
...
L =
∞
i=0
Li = {ǫ, a, b, aa, ab, ba, bb, . . .}
Semantics of Regular Expressions
Semantic Function L : Maps regular expressions to sets of strings.
L(ǫ) = {ǫ}
L(α) = {α} (α ∈ Σ)
L(r1 | r2) = L(r1) ∪ L(r2)
L(r1 r2) = L(r1) · L(r2)
L(r∗
) = {ǫ} ∪ (L(r) · L(r∗
))
Computing the Semantics
L(a) = {a}
L(a | b) = L(a) ∪ L(b)
= {a} ∪ {b}
= {a, b}
L(ab) = L(a) · L(b)
= {a} · {b}
= {ab}
L((a | b)(a | b)) = L(a | b) · L(a | b)
= {a, b} · {a, b}
= {aa, ab, ba, bb}
Computing the Semantics of Closure
Example: L((a | b)∗
)
= {ǫ} ∪ (L(a | b) · L((a | b)∗
))
L0 = {ǫ} Base case
L1 = {ǫ} ∪ ({a, b} · L0)
= {ǫ} ∪ ({a, b} · {ǫ})
= {ǫ, a, b}
L2 = {ǫ} ∪ ({a, b} · L1)
= {ǫ} ∪ ({a, b} · {ǫ, a, b})
= {ǫ, a, b, aa, ab, ba, bb}
...
L((a | b)∗
) = L∞ = {ǫ, a, b, aa, ab, ba, bb, . . .}
Another Example
L((a∗
b∗
)∗
) :
L(a∗
) = {ǫ, a, aa, . . .}
L(b∗
) = {ǫ, b, bb, . . .}
L(a∗
b∗
) = {ǫ, a, b, aa, ab, bb,
aaa, aab, abb, bbb, . . .}
L((a∗
b∗
)∗
) = {ǫ}
∪{ǫ, a, b, aa, ab, bb,
aaa, aab, abb, bbb, . . .}
∪{ǫ, a, b, aa, ab, ba, bb,
aaa, aab, aba, abb, baa, bab, bba, bbb, . . .}
.
.
.
Regular Definitions
Assign “names” to regular expressions.
For example,
digit −→ 0 | 1 | · · · | 9
natural −→ digit digit∗
Shorthands:
• a+
: Set of strings with one or more occurrences of a.
• a?
: Set of strings with zero or one occurrences of a.
Example:
integer −→ (+|−)?
digit+
Regular Definitions: Examples
float −→ integer . fraction
integer −→ (+|−)?
no leading zero
no leading zero −→ (nonzero digit digit∗
) | 0
fraction −→ no trailing zero exponent?
no trailing zero −→ (digit∗
nonzero digit) | 0
exponent −→ (E | e) integer
digit −→ 0 | 1 | · · · | 9
nonzero digit −→ 1 | 2 | · · · | 9
Regular Definitions and Lexical Analysis
Regular Expressions and Definitions specify sets of strings over an input alphabet.
• They can hence be used to specify the set of lexemes associated with a token.
⊲ Used as the pattern language
How do we decide whether an input string belongs to the set of strings specified by a regular expression?
Using Regular Definitions for Lexical Analysis
Q: Is ababbaabbb in L(((a∗
b∗
)∗
)?
A: Hm. Well. Let’s see.
L((a∗
b∗
)∗
) = {ǫ}
∪{ǫ, a, b, aa, ab, bb,
aaa, aab, abb, bbb, . . .}
∪{ǫ, a, b, aa, ab, ba, bb,
aaa, aab, aba, abb, baa, bab, bba, bbb, . . .}
...
= ???
Recognizers
Construct automata that recognize strings belonging to a language.
• Finite State Automata ⇒ Regular Languages
• Push Down Automata ⇒ Context-free Languages
⊲ Stack is used to maintain counter, but only one counter can go arbitrarily high.
Recognizing Finite Sets of Strings
Identifying words from a small, finite, fixed vocabulary is straightforward.
For instance, consider a stack machine with push, pop, and add operations with two constants: 0 and 1.
We can use the automaton:
s
h
p
p 0 1
u o
a
d
d
push
pop add
integer_constant
Finite State Automata
Represented by a labeled directed graph.
• A finite set of states (vertices).
• Transitions between states (edges).
• Labels on transitions are drawn from Σ ∪ {ǫ}.
• One distinguished start state.
• One or more distinguished final states.
Finite State Automata: An Example
Consider the Regular Expression (a | b)∗
a(a | b).
L((a | b)∗
a(a | b)) = {aa, ab, aaa, aab, baa, bab,
aaaa, aaab, abaa, abab, baaa, . . .}.
The following automaton determines whether an input string belongs to L((a | b)∗
a(a | b):
a
a
b b
a
1 2 3
Determinism
(a | b)∗
a(a | b):
Nondeterministic:
(NFA)
a
a
b b
a
1 2 3
Deterministic:
(DFA)
a
a
b
b
a
a
b
1 2
3
4
Acceptance Criterion
A finite state automaton (NFA or DFA) accepts an input string x
. . . if beginning from the start state
. . . we can trace some path through the automaton
. . . such that the sequence of edge labels spells x
. . . and end in a final state.
Recognition with an NFA
Is abab ∈ L((a | b)∗
a(a | b))?
a
a
b b
a
1 2 3
Input: a b a b
Path 1: 1 1 1 1 1
Path 2: 1 1 1 2 3 Accept
Path 3: 1 2 3 ⊥ ⊥
Accept
Recognition with an NFA
Is abab ∈ L((a | b)∗
a(a | b))?
a
a
b b
a
1 2 3
Input: a b a b
Path 1: 1 1 1 1 1
Path 2: 1 1 1 2 3 Accept
Path 3: 1 2 3 ⊥ ⊥
Accept
Recognition with a DFA
Is abab ∈ L((a | b)∗
a(a | b))?
a
a
b
b
a
a
b
b
1 2
3
4
NFA vs. DFA
For every NFA, there is a DFA that accepts the same set of strings.
• NFA may have transitions labeled by ǫ.
(Spontaneous transitions)
• All transition labels in a DFA belong to Σ.
• For some string x, there may be many accepting paths in an NFA.
• For all strings x, there is one unique accepting path in a DFA.
• Usually, an input string can be recognized faster with a DFA.
• NFAs are typically smaller than the corresponding DFAs.
Regular Expressions to NFA
Thompson’s Construction: For every regular expression r, derive an NFA N(r) with unique start and final states.
ǫ
ε
α ∈ Σ
α
(r1 | r2)
N(r )
1
ε
ε
ε
ε
N(r )
2
Regular Expressions to NFA (contd.)
r1r2 N(r )2
N(r )1
ε ε
r∗
ε ε
N(r)
ε
ε
Example
(a | b)∗
a(a | b):
ε
ε ε
ε
a
b
ε ε a
ε
ε ε
ε
a
b
ε
Recognition with an NFA
Is abab ∈ L((a | b)∗
a(a | b))?
a
a
b b
a
1 2 3
Input: a b a b
Path 1: 1 1 1 1 1
Path 2: 1 1 1 2 3 Accept
Path 3: 1 2 3 ⊥ ⊥
All Paths {1} {1, 2} {1, 3} {1, 2} {1, 3} Accept
Recognition with an NFA (contd.)
Is aaab ∈ L((a | b)∗
a(a | b))?
a
a
b b
a
1 2 3
Input: a a a b
Path 1: 1 1 1 1 1
Path 2: 1 1 1 1 2
Path 3: 1 1 1 2 3 Accept
Path 4: 1 1 2 3 ⊥
Path 5: 1 2 3 ⊥ ⊥
All Paths {1} {1, 2} {1, 2, 3} {1, 2, 3} {1, 2, 3} Accept
Recognition with an NFA (contd.)
Is aabb ∈ L((a | b)∗
a(a | b))?
a
a
b
b
a
1 2 3
Input: a a a b
Path 1: 1 1 1 1 1
Path 2: 1 1 2 3 ⊥
Path 3: 1 2 3 ⊥ ⊥
All Paths {1} {1, 2} {1, 2, 3} {1, 3} {1} REJECT
Converting NFA to DFA
Subset construction
Given a set S of NFA states,
• compute Sǫ = ǫ-closure(S): Sǫ is the set of all NFA states reachable by zero or more ǫ-transitions from S.
• compute Sα = goto(S, α):
– S′
is the set of all NFA states reachable from S by taking a transition labeled α.
– Sα = ǫ-closure(S′
).
Converting NFA to DFA (contd).
Each state in DFA corresponds to a set of states in NFA.
Start state of DFA = ǫ-closure(start state of NFA).
From a state s in DFA that corresponds to a set of states S in NFA:
add a transition labeled α to state s′
that corresponds to a non-empty S′
in NFA,
such that S′
= goto(S, α).
⇐ s is a final state of DFA
NFA → DFA: An Example
a
a
b b
a
1 2 3
ǫ-closure({1}) = {1}
goto({1}, a) = {1, 2}
goto({1}, b) = {1}
goto({1, 2}, a) = {1, 2, 3}
goto({1, 2}, b) = {1, 3}
goto({1, 2, 3}, a) = {1, 2, 3}
...
NFA → DFA: An Example (contd.)
ǫ-closure({1}) = {1}
goto({1}, a) = {1, 2}
goto({1}, b) = {1}
goto({1, 2}, a) = {1, 2, 3}
goto({1, 2}, b) = {1, 3}
goto({1, 2, 3}, a) = {1, 2, 3}
goto({1, 2, 3}, b) = {1}
goto({1, 3}, a) = {1, 2}
goto({1, 3}, b) = {1}
NFA → DFA: An Example (contd.)
goto({1}, a) = {1, 2}
goto({1}, b) = {1}
goto({1, 2}, a) = {1, 2, 3}
goto({1, 2}, b) = {1, 3}
goto({1, 2, 3}, a) = {1, 2, 3}
...
a
a
b
b
a
a
b
b
{1} {1,2}
{1,3}
{1,2,3}
NFA vs. DFA
R = Size of Regular Expression
N = Length of Input String
NFA DFA
Size of
Automaton
O(R) O(2R
)
Lexical Analysis
• Regular Expressions and Definitions are used to specify the set of strings (lexemes) corresponding to a token.
• An automaton (DFA/NFA) is built from the above specifications.
• Each final state is associated with an action: emit the corresponding token.
Specifying Lexical Analysis
Consider a recognizer for integers (sequence of digits) and floats (sequence of digits separated by a decimal point).
[0-9]+ { emit(INTEGER_CONSTANT); }
[0-9]+"."[0-9]+ { emit(FLOAT_CONSTANT); }
0-9
0-9
0-9
0-9
ε
0-9
0-9
ε "."
INTEGER_CONSTANT
FLOAT_CONSTANT
Lex
Tool for building lexical analyzers.
Input: lexical specifications (.l file)
Output: C function (yylex) that returns a token on each invocation.
%%
[0-9]+ { return(INTEGER_CONSTANT); }
[0-9]+"."[0-9]+ { return(FLOAT_CONSTANT); }
Tokens are simply integers (#define’s).
Lex Specifications
%{
C header statements for inclusion
%}
Regular Definitions e.g.:
digit [0-9]
%%
Token Specifications e.g.:
{digit}+ { return(INTEGER_CONSTANT); }
%%
Support functions in C
Regular Expressions in Lex
• Range: [0-7]: Integers from 0 through 7 (inclusive)
[a-nx-zA-Q]: Letters a thru n, x thru z and A thru Q.
• Exception: [^/]: Any character other than /.
• Definition: {digit}: Use the previously specified regular definition digit.
• Special characters: Connectives of regular expression, convenience features.
e.g.: | * ^
Special Characters in Lex
| * + ? ( ) Same as in regular expressions
[ ] Enclose ranges and exceptions
{ } Enclose “names” of regular definitions
^ Used to negate a specified range (in Exception)
. Match any single character except newline
 Escape the next character
n, t Newline and Tab
For literal matching, enclose special characters in double quotes (") e.g.: "*"
Or use  to escape. e.g.: "
Examples
for Sequence of f, o, r
"||" C-style OR operator (two vert. bars)
.* Sequence of non-newline characters
[^*/]+ Sequence of characters except * and /
"[^"]*" Sequence of non-quote characters
beginning and ending with a quote
({letter}|" ")({letter}|{digit}|" ")*
C-style identifiers
A Complete Example
%{
#include <stdio.h>
#include "tokens.h"
%}
digit [0-9]
hexdigit [0-9a-f]
%%
"+" { return(PLUS); }
"-" { return(MINUS); }
{digit}+ { return(INTEGER_CONSTANT); }
{digit}+"."{digit}+ { return(FLOAT_CONSTANT); }
. { return(SYNTAX_ERROR); }
%%
Actions
Actions are attached to final states.
• Distinguish the different final states.
• Can be used to set attribute values.
• Fragment of C code (blocks enclosed by ‘{’ and ‘}’).
Attributes
Additional information about a token’s lexeme.
• Stored in variable yylval
• Type of attributes (usually a union) specified by YYSTYPE
• Additional variables:
– yytext: Lexeme (Actual text string)
– yyleng: length of string in yytext
⊲ yylineno: Current line number (number of ‘n’ seen thus far)
∗ enabled by %option yylineno
Priority of matching
What if an input string matches more than one pattern?
"if" { return(TOKEN_IF); }
{letter}+ { return(TOKEN_ID); }
"while" { return(TOKEN_WHILE); }
• A pattern that matches the longest string is chosen.
Example: if1 is matched with an identifier, not the keyword if.
• Of patterns that match strings of same length, the first (from the top of file) is chosen.
Example: while is matched as an identifier, not the keyword while.
Constructing Scanners using (f)lex
• Scanner specifications: specifications.l
(f)lex
specifications.l −−−−→ lex.yy.c
• Generated scanner in lex.yy.c
(g)cc
lex.yy.c −−−−→ executable
– yywrap(): hook for signalling end of file.
– Use -lfl (flex) or -ll (lex) flags at link time to include default function yywrap() that always returns 1.
Implementing a Scanner
transition : state × Σ → state
algorithm scanner() {
current state = start state;
while (1) {
c = getc(); /* on end of file, ... */
if defined(transition(current state, c))
current state = transition(current state, c);
else
return s;
}
Implementing a Scanner (contd.)
Implementing the transition function:
• Simplest: 2-D array.
Space inefficient.
• Traditionally compressed using row/colum equivalence. (default on (f)lex)
Good space-time tradeoff.
• Further table compression using various techniques:
– Example: RDM (Row Displacement Method):
Store rows in overlapping manner using 2 1-D arrays.
Smaller tables, but longer access times.
Lexical Analysis: A Summary
Convert a stream of characters into a stream of tokens.
• Make rest of compiler independent of character set
• Strip off comments
• Recognize line numbers
• Ignore white space characters
• Process macros (definitions and uses)
• Interface with symbol (name) table.

More Related Content

PPTX
Regular expressions
Ratnakar Mikkili
 
PPT
Regular expressions-Theory of computation
Bipul Roy Bpl
 
PDF
Regular language and Regular expression
Animesh Chaturvedi
 
PPTX
Theory of automata and formal language
Rabia Khalid
 
PPTX
Finite automata-for-lexical-analysis
Dattatray Gandhmal
 
PDF
Automata
Gaditek
 
PPT
Regular expression with DFA
Maulik Togadiya
 
PDF
Minimizing DFA
Animesh Chaturvedi
 
Regular expressions
Ratnakar Mikkili
 
Regular expressions-Theory of computation
Bipul Roy Bpl
 
Regular language and Regular expression
Animesh Chaturvedi
 
Theory of automata and formal language
Rabia Khalid
 
Finite automata-for-lexical-analysis
Dattatray Gandhmal
 
Automata
Gaditek
 
Regular expression with DFA
Maulik Togadiya
 
Minimizing DFA
Animesh Chaturvedi
 

What's hot (19)

PPTX
theory of computation lecture 02
8threspecter
 
PPTX
Theory of computation Lec2
Arab Open University and Cairo University
 
PDF
Theory of Computation Regular Expressions, Minimisation & Pumping Lemma
Rushabh2428
 
PDF
Automata theory
Pardeep Vats
 
PPT
Chapter Two(1)
bolovv
 
DOCX
Automata theory introduction
NAMRATA BORKAR
 
PPT
Chapter Three(2)
bolovv
 
PPT
Ch 2 lattice & boolean algebra
Rupali Rana
 
PDF
Aho corasick-lecture
PekkaKilpelinen2
 
PDF
Regular expression
Rajon
 
DOC
AUTOMATA THEORY - SHORT NOTES
suthi
 
PPT
Finite automata examples
ankitamakin
 
PDF
Assembly Language Programming By Ytha Yu, Charles Marut Chap 10 ( Arrays and ...
Bilal Amjad
 
PDF
regular expressions (Regex)
Rebaz Najeeb
 
PDF
FLAT Notes
dilip kumar
 
PPT
Chapter Eight(2)
bolovv
 
theory of computation lecture 02
8threspecter
 
Theory of computation Lec2
Arab Open University and Cairo University
 
Theory of Computation Regular Expressions, Minimisation & Pumping Lemma
Rushabh2428
 
Automata theory
Pardeep Vats
 
Chapter Two(1)
bolovv
 
Automata theory introduction
NAMRATA BORKAR
 
Chapter Three(2)
bolovv
 
Ch 2 lattice & boolean algebra
Rupali Rana
 
Aho corasick-lecture
PekkaKilpelinen2
 
Regular expression
Rajon
 
AUTOMATA THEORY - SHORT NOTES
suthi
 
Finite automata examples
ankitamakin
 
Assembly Language Programming By Ytha Yu, Charles Marut Chap 10 ( Arrays and ...
Bilal Amjad
 
regular expressions (Regex)
Rebaz Najeeb
 
FLAT Notes
dilip kumar
 
Chapter Eight(2)
bolovv
 
Ad

Viewers also liked (20)

PPTX
Evaluation 3
nctcmedia12
 
PDF
Sesjon S4B 08/05 "Dokker + FYR = Sant" ved NKUL 2014
Snorre Tørriseng
 
PPT
Evaluation Question 2
rturner93
 
PPT
Atheism - By Suhit Kulkarni
Suhit Kulkarni
 
PPT
Will i guess_your_birthdate
lady_shine
 
PPT
Cilc2013 .hr.wk.ewv 20130314
Heimo Rainer
 
PPTX
Digital Workplace by Lizard Soft
Igor Petrushyn
 
PPT
Essai
Geoffrey Beikes
 
PDF
Criteria for the design of pressure transducer adapter systems
Yavuz özkaptan
 
PDF
A team may 23 2013
bscisteam
 
DOCX
BI Apps OLAP & Reports- SSAS 2012 Tabular & Multidimensional
Sunny U Okoro
 
PPT
Lezione2schetchup 111126133835-phpapp01
Giuliana Finco
 
PDF
Two way fine art pen & brush
ibec546
 
PPTX
Mobile Marketing
Mark Wilson
 
PPTX
Hotel web ranking
cmhagc
 
PPT
Pat Ward - Church Growth Clinic
theorchardoxford
 
PDF
гост пеноблок
Al Maks
 
PPTX
Pengenalan Kepada Teknologi Multimedia Part 3
Noor Hafizah Abd. Rahim
 
DOC
Mrs craig final exam 3
cmhagc
 
PPT
Value chains which unlock market opportunities
agbiz
 
Evaluation 3
nctcmedia12
 
Sesjon S4B 08/05 "Dokker + FYR = Sant" ved NKUL 2014
Snorre Tørriseng
 
Evaluation Question 2
rturner93
 
Atheism - By Suhit Kulkarni
Suhit Kulkarni
 
Will i guess_your_birthdate
lady_shine
 
Cilc2013 .hr.wk.ewv 20130314
Heimo Rainer
 
Digital Workplace by Lizard Soft
Igor Petrushyn
 
Criteria for the design of pressure transducer adapter systems
Yavuz özkaptan
 
A team may 23 2013
bscisteam
 
BI Apps OLAP & Reports- SSAS 2012 Tabular & Multidimensional
Sunny U Okoro
 
Lezione2schetchup 111126133835-phpapp01
Giuliana Finco
 
Two way fine art pen & brush
ibec546
 
Mobile Marketing
Mark Wilson
 
Hotel web ranking
cmhagc
 
Pat Ward - Church Growth Clinic
theorchardoxford
 
гост пеноблок
Al Maks
 
Pengenalan Kepada Teknologi Multimedia Part 3
Noor Hafizah Abd. Rahim
 
Mrs craig final exam 3
cmhagc
 
Value chains which unlock market opportunities
agbiz
 
Ad

Similar to Lex analysis (20)

PPTX
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
PDF
Dfa
azamcse29
 
PPTX
Regular Expression to Finite Automata
Archana Gopinath
 
PPTX
Lec1.pptx
ziadk6872
 
PDF
6-Nfa & equivalence with RE.pdf
shruti533256
 
PPT
02. chapter 3 lexical analysis
raosir123
 
PPT
02. Chapter 3 - Lexical Analysis NLP.ppt
charvivij
 
PDF
Automata_Theory_and_compiler_design_UNIT-1.pptx.pdf
TONY562
 
PPT
Ch3.ppt
MDSayem35
 
PPTX
TCS MUBAI UNIVERSITY ATHARVA COLLEGE OF ENGINEERING.pptx
userqwerty2612
 
PDF
TCS GOLDEN NOTES THEORY OF COMPUTATION .pdf
userqwerty2612
 
DOCX
UNIT_-_II.docx
karthikeyan Muthusamy
 
PPT
compiler Design course material chapter 2
gadisaAdamu
 
PDF
Complier Design - Operations on Languages, RE, Finite Automata
Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt
 
PPT
Lecture 1 - Lexical Analysis.ppt
NderituGichuki1
 
PPTX
Implementation of lexical analyser
Archana Gopinath
 
PPT
Finite automata(For college Seminars)
Naman Joshi
 
PDF
Automata
Gaditek
 
PPTX
Regular expression
MONIRUL ISLAM
 
PPTX
Unit2 Toc.pptx
viswanath kani
 
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Regular Expression to Finite Automata
Archana Gopinath
 
Lec1.pptx
ziadk6872
 
6-Nfa & equivalence with RE.pdf
shruti533256
 
02. chapter 3 lexical analysis
raosir123
 
02. Chapter 3 - Lexical Analysis NLP.ppt
charvivij
 
Automata_Theory_and_compiler_design_UNIT-1.pptx.pdf
TONY562
 
Ch3.ppt
MDSayem35
 
TCS MUBAI UNIVERSITY ATHARVA COLLEGE OF ENGINEERING.pptx
userqwerty2612
 
TCS GOLDEN NOTES THEORY OF COMPUTATION .pdf
userqwerty2612
 
UNIT_-_II.docx
karthikeyan Muthusamy
 
compiler Design course material chapter 2
gadisaAdamu
 
Complier Design - Operations on Languages, RE, Finite Automata
Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt
 
Lecture 1 - Lexical Analysis.ppt
NderituGichuki1
 
Implementation of lexical analyser
Archana Gopinath
 
Finite automata(For college Seminars)
Naman Joshi
 
Automata
Gaditek
 
Regular expression
MONIRUL ISLAM
 
Unit2 Toc.pptx
viswanath kani
 

Recently uploaded (20)

PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
CDH. pptx
AneetaSharma15
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PPTX
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
DOCX
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
DOCX
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
PPTX
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PDF
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
PPTX
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PPTX
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
PPTX
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
PPTX
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
PPTX
Virus sequence retrieval from NCBI database
yamunaK13
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
CDH. pptx
AneetaSharma15
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
Virus sequence retrieval from NCBI database
yamunaK13
 

Lex analysis

  • 1. Phases of Syntax Analysis 1. Identify the words: Lexical Analysis. Converts a stream of characters (input program) into a stream of tokens. Also called Scanning or Tokenizing. 2. Identify the sentences: Parsing. Derive the structure of sentences: construct parse trees from a stream of tokens. Lexical Analysis Convert a stream of characters into a stream of tokens. • Simplicity: Conventions about “words” are often different from conventions about “sentences”. • Efficiency: Word identification problem has a much more efficient solution than sentence identification problem. • Portability: Character set, special characters, device features. Terminology • Token: Name given to a family of words. e.g., integer constant • Lexeme: Actual sequence of characters representing a word. e.g., 32894 • Pattern: Notation used to identify the set of lexemes represented by a token. e.g., [0 − 9]+ Terminology A few more examples: Token Sample Lexemes Pattern while while while integer constant 32894, -1093, 0 [0-9]+ identifier buffer size [a-zA-Z]+ Patterns How do we compactly represent the set of all lexemes corresponding to a token? For instance: The token integer constant represents the set of all integers: that is, all sequences of digits (0–9), preceded by an optional sign (+ or −). Obviously, we cannot simply enumerate all lexemes. Use Regular Expressions. Regular Expressions Notation to represent (potentially) infinite sets of strings over alphabet Σ. • a: stands for the set {a} that contains a single string a.
  • 2. ⊲ Analogous to Union. • ab: stands for the set {ab} that contains a single string ab. ⊲ Analogous to Product. ⊲ (a|b)(a|b): stands for the set {aa, ab, ba, bb}. • a∗ : stands for the set {ǫ, a, aa, aaa, . . .} that contains all strings of zero or more a’s. ⊲ Analogous to closure of the product operation. Regular Expressions Examples of Regular Expressions over {a, b}: • (a|b)∗ : Set of strings with zero or more a’s and zero or more b’s: {ǫ, a, b, aa, ab, ba, bb, aaa, aab, . . .} • (a∗ b∗ ): Set of strings with zero or more a’s and zero or more b’s such that all a’s occur before any b: {ǫ, a, b, aa, ab, bb, aaa, aab, abb, . . .} • (a∗ b∗ )∗ : Set of strings with zero or more a’s and zero or more b’s: {ǫ, a, b, aa, ab, ba, bb, aaa, aab, . . .} Language of Regular Expressions Let R be the set of all regular expressions over Σ. Then, • Empty String: ǫ ∈ R • Unit Strings: α ∈ Σ ⇒ α ∈ R • Concatenation: r1, r2 ∈ R ⇒ r1r2 ∈ R • Alternative: r1, r2 ∈ R ⇒ (r1 | r2) ∈ R • Kleene Closure: r ∈ R ⇒ r∗ ∈ R Regular Expressions Example: (a | b)∗ L0 = {ǫ} L1 = L0 · {a, b} = {ǫ} · {a, b} = {a, b} L2 = L1 · {a, b} = {a, b} · {a, b} = {aa, ab, ba, bb} L3 = L2 · {a, b} ... L = ∞ i=0 Li = {ǫ, a, b, aa, ab, ba, bb, . . .} Semantics of Regular Expressions
  • 3. Semantic Function L : Maps regular expressions to sets of strings. L(ǫ) = {ǫ} L(α) = {α} (α ∈ Σ) L(r1 | r2) = L(r1) ∪ L(r2) L(r1 r2) = L(r1) · L(r2) L(r∗ ) = {ǫ} ∪ (L(r) · L(r∗ )) Computing the Semantics L(a) = {a} L(a | b) = L(a) ∪ L(b) = {a} ∪ {b} = {a, b} L(ab) = L(a) · L(b) = {a} · {b} = {ab} L((a | b)(a | b)) = L(a | b) · L(a | b) = {a, b} · {a, b} = {aa, ab, ba, bb} Computing the Semantics of Closure Example: L((a | b)∗ ) = {ǫ} ∪ (L(a | b) · L((a | b)∗ )) L0 = {ǫ} Base case L1 = {ǫ} ∪ ({a, b} · L0) = {ǫ} ∪ ({a, b} · {ǫ}) = {ǫ, a, b} L2 = {ǫ} ∪ ({a, b} · L1) = {ǫ} ∪ ({a, b} · {ǫ, a, b}) = {ǫ, a, b, aa, ab, ba, bb} ... L((a | b)∗ ) = L∞ = {ǫ, a, b, aa, ab, ba, bb, . . .} Another Example L((a∗ b∗ )∗ ) : L(a∗ ) = {ǫ, a, aa, . . .} L(b∗ ) = {ǫ, b, bb, . . .} L(a∗ b∗ ) = {ǫ, a, b, aa, ab, bb, aaa, aab, abb, bbb, . . .} L((a∗ b∗ )∗ ) = {ǫ} ∪{ǫ, a, b, aa, ab, bb, aaa, aab, abb, bbb, . . .} ∪{ǫ, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, . . .} . . .
  • 4. Regular Definitions Assign “names” to regular expressions. For example, digit −→ 0 | 1 | · · · | 9 natural −→ digit digit∗ Shorthands: • a+ : Set of strings with one or more occurrences of a. • a? : Set of strings with zero or one occurrences of a. Example: integer −→ (+|−)? digit+ Regular Definitions: Examples float −→ integer . fraction integer −→ (+|−)? no leading zero no leading zero −→ (nonzero digit digit∗ ) | 0 fraction −→ no trailing zero exponent? no trailing zero −→ (digit∗ nonzero digit) | 0 exponent −→ (E | e) integer digit −→ 0 | 1 | · · · | 9 nonzero digit −→ 1 | 2 | · · · | 9 Regular Definitions and Lexical Analysis Regular Expressions and Definitions specify sets of strings over an input alphabet. • They can hence be used to specify the set of lexemes associated with a token. ⊲ Used as the pattern language How do we decide whether an input string belongs to the set of strings specified by a regular expression? Using Regular Definitions for Lexical Analysis Q: Is ababbaabbb in L(((a∗ b∗ )∗ )? A: Hm. Well. Let’s see. L((a∗ b∗ )∗ ) = {ǫ} ∪{ǫ, a, b, aa, ab, bb, aaa, aab, abb, bbb, . . .} ∪{ǫ, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, . . .} ... = ??? Recognizers Construct automata that recognize strings belonging to a language. • Finite State Automata ⇒ Regular Languages
  • 5. • Push Down Automata ⇒ Context-free Languages ⊲ Stack is used to maintain counter, but only one counter can go arbitrarily high. Recognizing Finite Sets of Strings Identifying words from a small, finite, fixed vocabulary is straightforward. For instance, consider a stack machine with push, pop, and add operations with two constants: 0 and 1. We can use the automaton: s h p p 0 1 u o a d d push pop add integer_constant Finite State Automata Represented by a labeled directed graph. • A finite set of states (vertices). • Transitions between states (edges). • Labels on transitions are drawn from Σ ∪ {ǫ}. • One distinguished start state. • One or more distinguished final states. Finite State Automata: An Example Consider the Regular Expression (a | b)∗ a(a | b). L((a | b)∗ a(a | b)) = {aa, ab, aaa, aab, baa, bab, aaaa, aaab, abaa, abab, baaa, . . .}. The following automaton determines whether an input string belongs to L((a | b)∗ a(a | b): a a b b a 1 2 3 Determinism (a | b)∗ a(a | b): Nondeterministic: (NFA) a a b b a 1 2 3 Deterministic: (DFA) a a b b a a b 1 2 3 4
  • 6. Acceptance Criterion A finite state automaton (NFA or DFA) accepts an input string x . . . if beginning from the start state . . . we can trace some path through the automaton . . . such that the sequence of edge labels spells x . . . and end in a final state. Recognition with an NFA Is abab ∈ L((a | b)∗ a(a | b))? a a b b a 1 2 3 Input: a b a b Path 1: 1 1 1 1 1 Path 2: 1 1 1 2 3 Accept Path 3: 1 2 3 ⊥ ⊥ Accept Recognition with an NFA Is abab ∈ L((a | b)∗ a(a | b))? a a b b a 1 2 3 Input: a b a b Path 1: 1 1 1 1 1 Path 2: 1 1 1 2 3 Accept Path 3: 1 2 3 ⊥ ⊥ Accept Recognition with a DFA Is abab ∈ L((a | b)∗ a(a | b))? a a b b a a b b 1 2 3 4
  • 7. NFA vs. DFA For every NFA, there is a DFA that accepts the same set of strings. • NFA may have transitions labeled by ǫ. (Spontaneous transitions) • All transition labels in a DFA belong to Σ. • For some string x, there may be many accepting paths in an NFA. • For all strings x, there is one unique accepting path in a DFA. • Usually, an input string can be recognized faster with a DFA. • NFAs are typically smaller than the corresponding DFAs. Regular Expressions to NFA Thompson’s Construction: For every regular expression r, derive an NFA N(r) with unique start and final states. ǫ ε α ∈ Σ α (r1 | r2) N(r ) 1 ε ε ε ε N(r ) 2 Regular Expressions to NFA (contd.) r1r2 N(r )2 N(r )1 ε ε r∗ ε ε N(r) ε ε Example (a | b)∗ a(a | b): ε ε ε ε a b ε ε a ε ε ε ε a b ε
  • 8. Recognition with an NFA Is abab ∈ L((a | b)∗ a(a | b))? a a b b a 1 2 3 Input: a b a b Path 1: 1 1 1 1 1 Path 2: 1 1 1 2 3 Accept Path 3: 1 2 3 ⊥ ⊥ All Paths {1} {1, 2} {1, 3} {1, 2} {1, 3} Accept Recognition with an NFA (contd.) Is aaab ∈ L((a | b)∗ a(a | b))? a a b b a 1 2 3 Input: a a a b Path 1: 1 1 1 1 1 Path 2: 1 1 1 1 2 Path 3: 1 1 1 2 3 Accept Path 4: 1 1 2 3 ⊥ Path 5: 1 2 3 ⊥ ⊥ All Paths {1} {1, 2} {1, 2, 3} {1, 2, 3} {1, 2, 3} Accept Recognition with an NFA (contd.) Is aabb ∈ L((a | b)∗ a(a | b))? a a b b a 1 2 3 Input: a a a b Path 1: 1 1 1 1 1 Path 2: 1 1 2 3 ⊥ Path 3: 1 2 3 ⊥ ⊥ All Paths {1} {1, 2} {1, 2, 3} {1, 3} {1} REJECT Converting NFA to DFA Subset construction Given a set S of NFA states, • compute Sǫ = ǫ-closure(S): Sǫ is the set of all NFA states reachable by zero or more ǫ-transitions from S. • compute Sα = goto(S, α): – S′ is the set of all NFA states reachable from S by taking a transition labeled α. – Sα = ǫ-closure(S′ ). Converting NFA to DFA (contd). Each state in DFA corresponds to a set of states in NFA. Start state of DFA = ǫ-closure(start state of NFA). From a state s in DFA that corresponds to a set of states S in NFA: add a transition labeled α to state s′ that corresponds to a non-empty S′ in NFA, such that S′ = goto(S, α).
  • 9. ⇐ s is a final state of DFA NFA → DFA: An Example a a b b a 1 2 3 ǫ-closure({1}) = {1} goto({1}, a) = {1, 2} goto({1}, b) = {1} goto({1, 2}, a) = {1, 2, 3} goto({1, 2}, b) = {1, 3} goto({1, 2, 3}, a) = {1, 2, 3} ... NFA → DFA: An Example (contd.) ǫ-closure({1}) = {1} goto({1}, a) = {1, 2} goto({1}, b) = {1} goto({1, 2}, a) = {1, 2, 3} goto({1, 2}, b) = {1, 3} goto({1, 2, 3}, a) = {1, 2, 3} goto({1, 2, 3}, b) = {1} goto({1, 3}, a) = {1, 2} goto({1, 3}, b) = {1} NFA → DFA: An Example (contd.) goto({1}, a) = {1, 2} goto({1}, b) = {1} goto({1, 2}, a) = {1, 2, 3} goto({1, 2}, b) = {1, 3} goto({1, 2, 3}, a) = {1, 2, 3} ... a a b b a a b b {1} {1,2} {1,3} {1,2,3} NFA vs. DFA R = Size of Regular Expression N = Length of Input String NFA DFA Size of Automaton O(R) O(2R )
  • 10. Lexical Analysis • Regular Expressions and Definitions are used to specify the set of strings (lexemes) corresponding to a token. • An automaton (DFA/NFA) is built from the above specifications. • Each final state is associated with an action: emit the corresponding token. Specifying Lexical Analysis Consider a recognizer for integers (sequence of digits) and floats (sequence of digits separated by a decimal point). [0-9]+ { emit(INTEGER_CONSTANT); } [0-9]+"."[0-9]+ { emit(FLOAT_CONSTANT); } 0-9 0-9 0-9 0-9 ε 0-9 0-9 ε "." INTEGER_CONSTANT FLOAT_CONSTANT Lex Tool for building lexical analyzers. Input: lexical specifications (.l file) Output: C function (yylex) that returns a token on each invocation. %% [0-9]+ { return(INTEGER_CONSTANT); } [0-9]+"."[0-9]+ { return(FLOAT_CONSTANT); } Tokens are simply integers (#define’s). Lex Specifications %{ C header statements for inclusion %} Regular Definitions e.g.: digit [0-9] %% Token Specifications e.g.: {digit}+ { return(INTEGER_CONSTANT); } %% Support functions in C Regular Expressions in Lex
  • 11. • Range: [0-7]: Integers from 0 through 7 (inclusive) [a-nx-zA-Q]: Letters a thru n, x thru z and A thru Q. • Exception: [^/]: Any character other than /. • Definition: {digit}: Use the previously specified regular definition digit. • Special characters: Connectives of regular expression, convenience features. e.g.: | * ^ Special Characters in Lex | * + ? ( ) Same as in regular expressions [ ] Enclose ranges and exceptions { } Enclose “names” of regular definitions ^ Used to negate a specified range (in Exception) . Match any single character except newline Escape the next character n, t Newline and Tab For literal matching, enclose special characters in double quotes (") e.g.: "*" Or use to escape. e.g.: " Examples for Sequence of f, o, r "||" C-style OR operator (two vert. bars) .* Sequence of non-newline characters [^*/]+ Sequence of characters except * and / "[^"]*" Sequence of non-quote characters beginning and ending with a quote ({letter}|" ")({letter}|{digit}|" ")* C-style identifiers A Complete Example %{ #include <stdio.h> #include "tokens.h" %} digit [0-9] hexdigit [0-9a-f] %% "+" { return(PLUS); } "-" { return(MINUS); } {digit}+ { return(INTEGER_CONSTANT); } {digit}+"."{digit}+ { return(FLOAT_CONSTANT); } . { return(SYNTAX_ERROR); } %% Actions Actions are attached to final states. • Distinguish the different final states.
  • 12. • Can be used to set attribute values. • Fragment of C code (blocks enclosed by ‘{’ and ‘}’). Attributes Additional information about a token’s lexeme. • Stored in variable yylval • Type of attributes (usually a union) specified by YYSTYPE • Additional variables: – yytext: Lexeme (Actual text string) – yyleng: length of string in yytext ⊲ yylineno: Current line number (number of ‘n’ seen thus far) ∗ enabled by %option yylineno Priority of matching What if an input string matches more than one pattern? "if" { return(TOKEN_IF); } {letter}+ { return(TOKEN_ID); } "while" { return(TOKEN_WHILE); } • A pattern that matches the longest string is chosen. Example: if1 is matched with an identifier, not the keyword if. • Of patterns that match strings of same length, the first (from the top of file) is chosen. Example: while is matched as an identifier, not the keyword while. Constructing Scanners using (f)lex • Scanner specifications: specifications.l (f)lex specifications.l −−−−→ lex.yy.c • Generated scanner in lex.yy.c (g)cc lex.yy.c −−−−→ executable – yywrap(): hook for signalling end of file. – Use -lfl (flex) or -ll (lex) flags at link time to include default function yywrap() that always returns 1. Implementing a Scanner transition : state × Σ → state algorithm scanner() { current state = start state; while (1) { c = getc(); /* on end of file, ... */ if defined(transition(current state, c)) current state = transition(current state, c); else return s; }
  • 13. Implementing a Scanner (contd.) Implementing the transition function: • Simplest: 2-D array. Space inefficient. • Traditionally compressed using row/colum equivalence. (default on (f)lex) Good space-time tradeoff. • Further table compression using various techniques: – Example: RDM (Row Displacement Method): Store rows in overlapping manner using 2 1-D arrays. Smaller tables, but longer access times. Lexical Analysis: A Summary Convert a stream of characters into a stream of tokens. • Make rest of compiler independent of character set • Strip off comments • Recognize line numbers • Ignore white space characters • Process macros (definitions and uses) • Interface with symbol (name) table.