Lex analysis

Phases of Syntax Analysis
1. Identify the words: Lexical Analysis.
Converts a stream of characters (input program) into a stream of tokens.
Also called Scanning or Tokenizing.
2. Identify the sentences: Parsing.
Derive the structure of sentences: construct parse trees from a stream of tokens.
Lexical Analysis
Convert a stream of characters into a stream of tokens.
• Simplicity: Conventions about “words” are often different from conventions about “sentences”.
• Efficiency: Word identification problem has a much more efficient solution than sentence identification problem.
• Portability: Character set, special characters, device features.
Terminology
• Token: Name given to a family of words.
e.g., integer constant
• Lexeme: Actual sequence of characters representing a word.
e.g., 32894
• Pattern: Notation used to identify the set of lexemes represented by a token.
e.g., [0 − 9]+
Terminology
A few more examples:
Token Sample Lexemes Pattern
while while while
integer constant 32894, -1093, 0 [0-9]+
identifier buffer size [a-zA-Z]+
Patterns
How do we compactly represent the set of all lexemes corresponding to a token?
For instance:
The token integer constant represents the set of all integers: that is, all sequences of digits (0–9), preceded by an optional
sign (+ or −).
Obviously, we cannot simply enumerate all lexemes.
Use Regular Expressions.
Regular Expressions
Notation to represent (potentially) infinite sets of strings over alphabet Σ.
• a: stands for the set {a} that contains a single string a.

⊲ Analogous to Union.
• ab: stands for the set {ab} that contains a single string ab.
⊲ Analogous to Product.
⊲ (a|b)(a|b): stands for the set {aa, ab, ba, bb}.
• a∗
: stands for the set {ǫ, a, aa, aaa, . . .} that contains all strings of zero or more a’s.
⊲ Analogous to closure of the product operation.
Regular Expressions
Examples of Regular Expressions over {a, b}:
• (a|b)∗
: Set of strings with zero or more a’s and zero or more b’s:
{ǫ, a, b, aa, ab, ba, bb, aaa, aab, . . .}
• (a∗
b∗
): Set of strings with zero or more a’s and zero or more b’s such that all a’s occur before any b:
{ǫ, a, b, aa, ab, bb, aaa, aab, abb, . . .}
• (a∗
b∗
)∗
: Set of strings with zero or more a’s and zero or more b’s:
{ǫ, a, b, aa, ab, ba, bb, aaa, aab, . . .}
Language of Regular Expressions
Let R be the set of all regular expressions over Σ. Then,
• Empty String: ǫ ∈ R
• Unit Strings: α ∈ Σ ⇒ α ∈ R
• Concatenation: r1, r2 ∈ R ⇒ r1r2 ∈ R
• Alternative: r1, r2 ∈ R ⇒ (r1 | r2) ∈ R
• Kleene Closure: r ∈ R ⇒ r∗
∈ R
Regular Expressions
Example: (a | b)∗
L0 = {ǫ}
L1 = L0 · {a, b}
= {ǫ} · {a, b}
= {a, b}
L2 = L1 · {a, b}
= {a, b} · {a, b}
= {aa, ab, ba, bb}
L3 = L2 · {a, b}
...
L =
∞
i=0
Li = {ǫ, a, b, aa, ab, ba, bb, . . .}
Semantics of Regular Expressions

Semantic Function L : Maps regular expressions to sets of strings.
L(ǫ) = {ǫ}
L(α) = {α} (α ∈ Σ)
L(r1 | r2) = L(r1) ∪ L(r2)
L(r1 r2) = L(r1) · L(r2)
L(r∗
) = {ǫ} ∪ (L(r) · L(r∗
))
Computing the Semantics
L(a) = {a}
L(a | b) = L(a) ∪ L(b)
= {a} ∪ {b}
= {a, b}
L(ab) = L(a) · L(b)
= {a} · {b}
= {ab}
L((a | b)(a | b)) = L(a | b) · L(a | b)
= {a, b} · {a, b}
= {aa, ab, ba, bb}
Computing the Semantics of Closure
Example: L((a | b)∗
)
= {ǫ} ∪ (L(a | b) · L((a | b)∗
))
L0 = {ǫ} Base case
L1 = {ǫ} ∪ ({a, b} · L0)
= {ǫ} ∪ ({a, b} · {ǫ})
= {ǫ, a, b}
L2 = {ǫ} ∪ ({a, b} · L1)
= {ǫ} ∪ ({a, b} · {ǫ, a, b})
= {ǫ, a, b, aa, ab, ba, bb}
...
L((a | b)∗
) = L∞ = {ǫ, a, b, aa, ab, ba, bb, . . .}
Another Example
L((a∗
b∗
)∗
) :
L(a∗
) = {ǫ, a, aa, . . .}
L(b∗
) = {ǫ, b, bb, . . .}
L(a∗
b∗
) = {ǫ, a, b, aa, ab, bb,
aaa, aab, abb, bbb, . . .}
L((a∗
b∗
)∗
) = {ǫ}
∪{ǫ, a, b, aa, ab, bb,
∪{ǫ, a, b, aa, ab, ba, bb,
aaa, aab, aba, abb, baa, bab, bba, bbb, . . .}
.
.
.

Regular Definitions
Assign “names” to regular expressions.
For example,
digit −→ 0 | 1 | · · · | 9
natural −→ digit digit∗
Shorthands:
• a+
: Set of strings with one or more occurrences of a.
• a?
: Set of strings with zero or one occurrences of a.
Example:
integer −→ (+|−)?
digit+
Regular Definitions: Examples
float −→ integer . fraction
integer −→ (+|−)?
no leading zero
no leading zero −→ (nonzero digit digit∗
) | 0
fraction −→ no trailing zero exponent?
no trailing zero −→ (digit∗
nonzero digit) | 0
exponent −→ (E | e) integer
digit −→ 0 | 1 | · · · | 9
nonzero digit −→ 1 | 2 | · · · | 9
Regular Definitions and Lexical Analysis
Regular Expressions and Definitions specify sets of strings over an input alphabet.
• They can hence be used to specify the set of lexemes associated with a token.
⊲ Used as the pattern language
How do we decide whether an input string belongs to the set of strings specified by a regular expression?
Using Regular Definitions for Lexical Analysis
Q: Is ababbaabbb in L(((a∗
b∗
)∗
)?
A: Hm. Well. Let’s see.
L((a∗
b∗
)∗
) = {ǫ}
∪{ǫ, a, b, aa, ab, bb,
∪{ǫ, a, b, aa, ab, ba, bb,
aaa, aab, aba, abb, baa, bab, bba, bbb, . . .}
...
= ???
Recognizers
Construct automata that recognize strings belonging to a language.
• Finite State Automata ⇒ Regular Languages

• Push Down Automata ⇒ Context-free Languages
⊲ Stack is used to maintain counter, but only one counter can go arbitrarily high.
Recognizing Finite Sets of Strings
Identifying words from a small, finite, fixed vocabulary is straightforward.
For instance, consider a stack machine with push, pop, and add operations with two constants: 0 and 1.
We can use the automaton:
s
h
p
p 0 1
u o
a
d
d
push
pop add
integer_constant
Finite State Automata
Represented by a labeled directed graph.
• A finite set of states (vertices).
• Transitions between states (edges).
• Labels on transitions are drawn from Σ ∪ {ǫ}.
• One distinguished start state.
• One or more distinguished final states.
Finite State Automata: An Example
Consider the Regular Expression (a | b)∗
a(a | b).
L((a | b)∗
a(a | b)) = {aa, ab, aaa, aab, baa, bab,
aaaa, aaab, abaa, abab, baaa, . . .}.
The following automaton determines whether an input string belongs to L((a | b)∗
a(a | b):
a
a
b b
a
1 2 3
Determinism
(a | b)∗
a(a | b):
Nondeterministic:
(NFA)
a
a
b b
a
1 2 3
Deterministic:
(DFA)
a
a
b
b
a
a
b
1 2
3
4

Acceptance Criterion
A ﬁnite state automaton (NFA or DFA) accepts an input string x
. . . if beginning from the start state
. . . we can trace some path through the automaton
. . . such that the sequence of edge labels spells x
. . . and end in a ﬁnal state.
Recognition with an NFA
Is abab ∈ L((a | b)∗
a(a | b))?
a
a
b b
a
1 2 3
Input: a b a b
Path 1: 1 1 1 1 1
Path 2: 1 1 1 2 3 Accept
Path 3: 1 2 3 ⊥ ⊥
Accept
a(a | b))?
a
a
b b
a
1 2 3
Input: a b a b
Path 1: 1 1 1 1 1
Path 3: 1 2 3 ⊥ ⊥
Accept
Recognition with a DFA
a(a | b))?
a
a
b
b
a
a
b
b
1 2
3
4

NFA vs. DFA
For every NFA, there is a DFA that accepts the same set of strings.
• NFA may have transitions labeled by ǫ.
(Spontaneous transitions)
• All transition labels in a DFA belong to Σ.
• For some string x, there may be many accepting paths in an NFA.
• For all strings x, there is one unique accepting path in a DFA.
• Usually, an input string can be recognized faster with a DFA.
• NFAs are typically smaller than the corresponding DFAs.
Regular Expressions to NFA
Thompson’s Construction: For every regular expression r, derive an NFA N(r) with unique start and ﬁnal states.
ǫ
ε
α ∈ Σ
α
(r1 | r2)
N(r )
1
ε
ε
ε
ε
N(r )
2
Regular Expressions to NFA (contd.)
r1r2 N(r )2
N(r )1
ε ε
r∗
ε ε
N(r)
ε
ε
Example
(a | b)∗
a(a | b):
ε
ε ε
ε
a
b
ε ε a
ε
ε ε
ε
a
b
ε

a(a | b))?
a
a
b b
a
1 2 3
Input: a b a b
Path 1: 1 1 1 1 1
Path 3: 1 2 3 ⊥ ⊥
All Paths {1} {1, 2} {1, 3} {1, 2} {1, 3} Accept
Recognition with an NFA (contd.)
Is aaab ∈ L((a | b)∗
a(a | b))?
a
a
b b
a
1 2 3
Input: a a a b
Path 1: 1 1 1 1 1
Path 2: 1 1 1 1 2
Path 4: 1 1 2 3 ⊥
Path 5: 1 2 3 ⊥ ⊥
All Paths {1} {1, 2} {1, 2, 3} {1, 2, 3} {1, 2, 3} Accept
Recognition with an NFA (contd.)
Is aabb ∈ L((a | b)∗
a(a | b))?
a
a
b
b
a
1 2 3
Input: a a a b
Path 1: 1 1 1 1 1
Path 2: 1 1 2 3 ⊥
Path 3: 1 2 3 ⊥ ⊥
All Paths {1} {1, 2} {1, 2, 3} {1, 3} {1} REJECT
Converting NFA to DFA
Subset construction
Given a set S of NFA states,
• compute Sǫ = ǫ-closure(S): Sǫ is the set of all NFA states reachable by zero or more ǫ-transitions from S.
• compute Sα = goto(S, α):
– S′
is the set of all NFA states reachable from S by taking a transition labeled α.
– Sα = ǫ-closure(S′
).
Converting NFA to DFA (contd).
Each state in DFA corresponds to a set of states in NFA.
Start state of DFA = ǫ-closure(start state of NFA).
From a state s in DFA that corresponds to a set of states S in NFA:
add a transition labeled α to state s′
that corresponds to a non-empty S′
in NFA,
such that S′
= goto(S, α).

⇐ s is a ﬁnal state of DFA
NFA → DFA: An Example
a
a
b b
a
1 2 3
ǫ-closure({1}) = {1}
goto({1}, a) = {1, 2}
goto({1}, b) = {1}
goto({1, 2}, a) = {1, 2, 3}
goto({1, 2}, b) = {1, 3}
goto({1, 2, 3}, a) = {1, 2, 3}
...
NFA → DFA: An Example (contd.)
ǫ-closure({1}) = {1}
goto({1}, a) = {1, 2}
goto({1}, b) = {1}
goto({1, 2}, a) = {1, 2, 3}
goto({1, 2}, b) = {1, 3}
goto({1, 2, 3}, a) = {1, 2, 3}
goto({1, 2, 3}, b) = {1}
goto({1, 3}, a) = {1, 2}
goto({1, 3}, b) = {1}
NFA → DFA: An Example (contd.)
goto({1}, a) = {1, 2}
goto({1}, b) = {1}
goto({1, 2}, a) = {1, 2, 3}
goto({1, 2}, b) = {1, 3}
goto({1, 2, 3}, a) = {1, 2, 3}
...
a
a
b
b
a
a
b
b
{1} {1,2}
{1,3}
{1,2,3}
NFA vs. DFA
R = Size of Regular Expression
N = Length of Input String
NFA DFA
Size of
Automaton
O(R) O(2R
)

Lexical Analysis
• Regular Expressions and Definitions are used to specify the set of strings (lexemes) corresponding to a token.
• An automaton (DFA/NFA) is built from the above specifications.
• Each final state is associated with an action: emit the corresponding token.
Specifying Lexical Analysis
Consider a recognizer for integers (sequence of digits) and floats (sequence of digits separated by a decimal point).
[0-9]+ { emit(INTEGER_CONSTANT); }
[0-9]+"."[0-9]+ { emit(FLOAT_CONSTANT); }
0-9
0-9
0-9
0-9
ε
0-9
0-9
ε "."
INTEGER_CONSTANT
FLOAT_CONSTANT
Lex
Tool for building lexical analyzers.
Input: lexical specifications (.l file)
Output: C function (yylex) that returns a token on each invocation.
%%
[0-9]+ { return(INTEGER_CONSTANT); }
[0-9]+"."[0-9]+ { return(FLOAT_CONSTANT); }
Tokens are simply integers (#define’s).
Lex Specifications
%{
C header statements for inclusion
%}
Regular Definitions e.g.:
digit [0-9]
%%
Token Specifications e.g.:
{digit}+ { return(INTEGER_CONSTANT); }
%%
Support functions in C
Regular Expressions in Lex

• Range: [0-7]: Integers from 0 through 7 (inclusive)
[a-nx-zA-Q]: Letters a thru n, x thru z and A thru Q.
• Exception: [^/]: Any character other than /.
• Definition: {digit}: Use the previously specified regular definition digit.
• Special characters: Connectives of regular expression, convenience features.
e.g.: | * ^
Special Characters in Lex
| * + ? ( ) Same as in regular expressions
[ ] Enclose ranges and exceptions
{ } Enclose “names” of regular definitions
^ Used to negate a specified range (in Exception)
. Match any single character except newline
Escape the next character
n, t Newline and Tab
For literal matching, enclose special characters in double quotes (") e.g.: "*"
Or use to escape. e.g.: "
Examples
for Sequence of f, o, r
"||" C-style OR operator (two vert. bars)
.* Sequence of non-newline characters
[^*/]+ Sequence of characters except * and /
"[^"]*" Sequence of non-quote characters
beginning and ending with a quote
({letter}|" ")({letter}|{digit}|" ")*
C-style identifiers
A Complete Example
%{
#include <stdio.h>
#include "tokens.h"
%}
digit [0-9]
hexdigit [0-9a-f]
%%
"+" { return(PLUS); }
"-" { return(MINUS); }
{digit}+ { return(INTEGER_CONSTANT); }
{digit}+"."{digit}+ { return(FLOAT_CONSTANT); }
. { return(SYNTAX_ERROR); }
%%
Actions
Actions are attached to final states.
• Distinguish the different final states.

• Can be used to set attribute values.
• Fragment of C code (blocks enclosed by ‘{’ and ‘}’).
Attributes
Additional information about a token’s lexeme.
• Stored in variable yylval
• Type of attributes (usually a union) specified by YYSTYPE
• Additional variables:
– yytext: Lexeme (Actual text string)
– yyleng: length of string in yytext
⊲ yylineno: Current line number (number of ‘n’ seen thus far)
∗ enabled by %option yylineno
Priority of matching
What if an input string matches more than one pattern?
"if" { return(TOKEN_IF); }
{letter}+ { return(TOKEN_ID); }
"while" { return(TOKEN_WHILE); }
• A pattern that matches the longest string is chosen.
Example: if1 is matched with an identifier, not the keyword if.
• Of patterns that match strings of same length, the first (from the top of file) is chosen.
Example: while is matched as an identifier, not the keyword while.
Constructing Scanners using (f)lex
• Scanner specifications: specifications.l
(f)lex
specifications.l −−−−→ lex.yy.c
• Generated scanner in lex.yy.c
(g)cc
lex.yy.c −−−−→ executable
– yywrap(): hook for signalling end of file.
– Use -lfl (flex) or -ll (lex) flags at link time to include default function yywrap() that always returns 1.
Implementing a Scanner
transition : state × Σ → state
algorithm scanner() {
current state = start state;
while (1) {
c = getc(); /* on end of file, ... */
if defined(transition(current state, c))
current state = transition(current state, c);
else
return s;
}

Implementing a Scanner (contd.)
Implementing the transition function:
• Simplest: 2-D array.
Space inefficient.
• Traditionally compressed using row/colum equivalence. (default on (f)lex)
Good space-time tradeoff.
• Further table compression using various techniques:
– Example: RDM (Row Displacement Method):
Store rows in overlapping manner using 2 1-D arrays.
Smaller tables, but longer access times.
Lexical Analysis: A Summary
Convert a stream of characters into a stream of tokens.
• Make rest of compiler independent of character set
• Strip off comments
• Recognize line numbers
• Ignore white space characters
• Process macros (definitions and uses)
• Interface with symbol (name) table.

Lex analysis

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Lex analysis (20)

Recently uploaded (20)

Lex analysis