SlideShare a Scribd company logo
What The Regex_Cat
Dragged In
Kyle Hamilton
COS210 - MCCC, Prof. Bostain
December 2013
Regular Expression:
From Wikipedia, the free encyclopedia
In computing, a regular expression (abbreviated regex or regexp) is a sequence of characters
that forms a search pattern, mainly for use in pattern matching with strings, or string matching,
i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American
mathematician Stephen Kleene formalized the description of a regular language, and came into
common use with the Unix text processing utilities ed, an editor, and grep (global regular ex-
pression print), a filter.
Each character in a regular expression is either understood to be a metacharacter with its spe-
cial meaning, or a regular character with its literal meaning. Together, they can be used to iden-
tify textual material of a given pattern, or process a number of instances of it that can vary from
a precise equality to a very general similarity of the pattern. The pattern sequence itself is an
expression that is a statement in a language designed specifically to represent prescribed tar-
gets in the most concise and flexible way to direct the automation of text processing of general
text files, specific textual forms, or of random input strings.
The alphabet, Σ, will consist of all printable ASCII characters except the metacharacters.
The meatcharacters in Regex_Cat are:
[](){}*+?^$|.
What’s supported in Regex_Cat
Currently the most basic regular expression operations can be performed:
()[]*+?|
To support the POSIX Extended standard, the following features have yet to be added:
Escapes:
 Any special character preceded by an escape shall match itself.
Assertions:
'^' shall match the start of a line when used as the first character of an expression, or the first character of a sub-expression.
'$' shall match the end of a line when used as the last character of an expression, or the last character of a sub-expression.
Repetition:
a{n} Matches 'a' repeated exactly n times.
a{n,} Matches 'a' repeated n or more times.
a{n, m} Matches 'a' repeated between n and m times inclusive.
Wildcard:
The single character '.' when used outside of a character set will match any single character.
12 Character Classes:
[:alnum:], [:alpha:], [:blank:], [:cntrl:], [:graph:], [:print:], [:punct:], [:digit:], [:lower:], [:space:], [:upper:], [:xdigit:]
Ex: [:alnum:] matches Alphanumeric characters, as would: [a-zA-Z0-9] The latter form is supported in the current Regex_Cat.
What Regex_Cat does
1. Parse regex
A. Insert explicit concatenation operators. Regex_Cat uses a ‘•’ which is not part of the first
128 printable ASCII characters. Care must be taken to compile Regex_Cat with UTF-8
encoding
B. Convert expression from infix to postfix form.
C. Save the Alphabet of the regex for later.
2. Build NFA
Convert Postfix expression to NFA
3. Convert to DFA
Use Subset construction to convert from NFA to DFA
Build DFA Transition Table
4. Simulate
Process the target string using the DFA Transition Table and Alphabet
ArrayList
A self expanding array. For storing data elements as
well as iterating over said elements.
DoublyLinkedList
Similar to the ArrayList, but with the ability to iterate in
both directions.
Stack
Great for bouncing stuff around by ‘pushing’ and ‘pop-
ping’ elements from stack to stack. The Infix to Postfix
algorithm is a good example.
Directional List Graph
Used in the construction of the NFA, the graph stores
Edge data. Each Edge has a source vertex, a desti-
nation vertex, and a ‘weight’ - in this case a character
in the alphabet of the regex.
The graph is implemented as an Adjacency List is
used, as opposed to an ‘Adjacency Matrix’, because the
NFA is ‘sparse’. A sparse graph is one in which the
number of edges is much less than the possible num-
ber of edges. The NFA vertices are connected by at
most 2 edges, and in most cases, only 1 edge.
Directional Matrix Graph
The DFA on the other hand is a type of Adjacency Ma-
trix, implemented as a 2-dimentional array. This
makes the regex engine algorithm simple and fast, with
run-times of O(n). The potential trade-off is the DFA’s
memory footprint caused by ‘state explosion’ - at worst,
the DFA can contain 2^n - 1 number of states of the
NFA (where ‘n’ is the number of states).
Data Structures
Algorithms
1. Infx To Postfix
Parses regular expression
2. Thompson-McNaughton-Yamada:
Constructs NFA
3. Subset Construction:
Converts NFA to DFA
Constructs Transition Table
4. Simulation Engine
Processes the target String
Figure: A really cool visual NFA to DFA simulator:
http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
a*(b|cd?)+
NFA
DFA
Infix to Postfix: InfixToPostfix.java
The first step is to parse the expression into a form suitable to building the State Machine:
Since the input has no explicit concatenation operators, we first insert those. We use the ‘•’ symbol, since it is not within the ASCII set
we are supporting, hence it won’t conflict with any regular charcters. We just have to careful when compiling the code to set the JVM
to use UTF-8 encoding.
The next step is to convert the expression to ‘Polish notation’, or, Postfix form. The algorithm to convert an infix expression into a
postfix expression uses a stack. The stack is used to hold operators. The purpose of the stack is to reverse the order of the operators
in the expression. It also serves as a storage structure, since no operator can be printed until both of its operands have appeared.
We also store the Alphabet of the regex in this step, so we can use it later to build the transition table, and process the target string.
A summary of the rules:
1. Print operands as they arrive.
2. If the stack is empty or contains a left parenthesis on top, push the incoming
operator onto the stack.
3. If the incoming symbol is a left parenthesis, push it on the stack.
4. If the incoming symbol is a right parenthesis, pop the stack and print the opera-
tors until you see a left parenthesis. Discard the pair of parentheses.
5. If the incoming symbol has higher precedence than the top of the stack, push it
on the stack.
6. If the incoming symbol has equal precedence with the top of the stack, use as-
sociation. If the association is left to right, pop and print the top of the stack and
then push the incoming operator. If the association is right to left, push the in-
coming operator.
7. If the incoming symbol has lower precedence than the symbol on the top of the
stack, pop the stack and print the top operator. Then test the incoming operator
against the new top of stack.
8. At the end of the expression, pop and print all operators on the stack. (No
parentheses should remain.)
Taken from: http://csis.pace.edu/~wolf/CS122/infix-postfix.htm
Precendence
Collation-related bracket symbols [==] [::] [..]
Escaped characters 
Character set (bracket expression) []
Grouping ()
Single-character-ERE duplication * + ? {m,n}
Concatenation
Anchoring ^$
Alternation |
InfixToPostfix.java in our package:
private static final String OPERATORS = "[](){}*+?•^$|.";
private static final int[] PRECEDENCE = {6,6,5,5,4,4,4,4,4,3,2,2,1,1};
This illustration uses a + symbol for concatenation. Regex_cat uses + to denote “one or more”, so our
implementation uses a • as mentioned on the previous page.
Taken from: http://csis.pace.edu/~wolf/CS122/infix-postfix.htm
Infix to Postfix illustrated:
Finite State Machine | Finite State Automaton
Taken from: http://www.thefreedictionary.com/finite+state+machine
A model of a computational system, consisting of a set of states (including a start state), an alphabet of symbols
that serves as a set of possible inputs to the machine, and a transition function that maps each state to another
state (or to itself) for any given input symbol. The machine operates by being fed a string of symbols, and moves
through a series of states.
NFA(Nondeterministic Finite Automaton) VS. DFA(Deterministic Finite Automaton)
Taken from: http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton
A nondeterministic finite automaton (NFA), or nondeterministic finite state machine, is a finite state machine that (1) does
not require input symbols for state transitions and (2) is capable of transitioning to zero or two or more states for a given
start state and input symbol. This distinguishes it from a deterministic finite automaton (DFA), in which all transitions are
uniquely determined and in which an input symbol is required for all state transitions. Although NFA and DFA have distinct
definitions, all NFAs can be translated to equivalent DFAs using the subset construction algorithm,[1] i.e., constructed
DFAs and their corresponding NFAs recognize the same formal language. Like DFAs, NFAs only recognize regular lan-
guages.
An NFA, similar to a DFA, consumes a string of input symbols. For each input symbol, it transitions to a new state until all
input symbols have been consumed. Unlike a DFA, it is non-deterministic, i.e., for some state and input symbol, the next
state may be one of two or more possible states.
An NFA is represented formally by a 5-tuple, (Q, Σ, Δ, q0, F), consisting of
a finite set of states Q
a finite set of input symbols Σ ( the Alphabet of the regex )
a transition relation Δ : Q × Σ → P(Q). ( the Edges )
an initial (or start) state q0 ∈ Q
a set of states F distinguished as accepting (or final) states F ⊆ Q.
A DFA is also represented by a 5-tuple, (Q, Σ, δ, q0, F), with one difference.
The DFA has a transition function (δ : Q × Σ → Q) in place of the the transition relation Δ in the NFA.
0
1
ε
3
ε
2
a
5
ε
6
ε
8
ε
4
ε
7
b
ε
9
c
ε
0
1
a
2
b
3
c
NFA with Epsilon States Equivalent DFA
Regular Expression: [a-c]
Regular expressions and NFAs are equivalent in power: every
regular expression has an equivalent NFA (they match the same
strings) and vice versa. DFAs are also equivalent in power to
NFAs and regular expressions.
Thompson-McNaughton-Yamada NFA.java
The NFA for a regular expression is built up from partial NFAs for each subexpression, with a different construction for
each operator, as well as epsilon states. The Start State and Accepting State of each partial NFA is used to attach the next
partial NFA in the building process.
The postfix regular expression is used to build an NFA using a stack:
while(the regular expression is not exhausted)
When a symbol is encountered, an NFA corresponding to that symbol is constructed
and pushed onto the stack.
When an operator is encountered, a number of NFAs, consistent with the operator,
is popped and used to construct a new NFA using the corresponding operator.
A new Start state and a new Accept state is assigned.
That newly constructed NFA is then pushed onto the stack.
NFA construction operations
Subset Construction: NFA to DFA DFA.java
The Subset constrauction algorithm consists of the following functions:
getEclosure()calculates epslison closures of the NFA.
Each closure becomes a State T of the DFA, and is added to the list of DFA States DS.
move()assigns edges to DFA states by iterating over the e-closures.
Subset Construction:
while (there is an unmarked state T in DS){
mark T
for (each letter ‘a’ in the alphabet ){
U = getEclosure(move(T,a))
if(U is not already in DS){
add U to DS as an unmarked state
}
push transition [T,a]=U onto the dfaTransitionStack of DFA transitons
( implemented as an ArrayList of int[] = {T, a, U} )
}
}
getEclosure(S):
start with the Start State of the NFA
push all states onto the States List S
for each state in S
push state onto stack
add state to eClosure list
while(stack is not empty){
pop stack
if(the edge is an epsilon edge, and the destination edge doesn’t already exist in the eClosure){
push the destination state onto the stack
add the destination state to the eClosure list
}
}
return the eClosure list
move(S, a)
for each state s in List S
for each Edge in s in NFA
if the weight == the letter ‘a’ of the Alphabet,
add desitnation state to list
return list
Subset Construction con’t:
Construct the Transition Table from the list of DFA Transitions:
TransTable = new int[number of DFA States][length of the Alphabet];
while(!dfaTransitionStack.empty()){
int[] trans = dfaTransitionStack.pop();
TransTable [trans[0]][trans[1]] = trans[2];
}
Subset Construction con’t:
Sample Output:
Simulation Regex.java
// exact match:
private boolean eatDFA(int[][] table, String lang, ArrayList<Integer> finalStates, String target) throws IOException{
BufferedReader br = stringToBR(target);
int c;
int state = 0;
while (((c = br.read()) != -1)) {
int input = lang.indexOf((char)c);
if(input != -1){ // only continue if input char is in the language
if (table[state][input] == -1){
// unreachable state
return false;
}else if ( finalStates.contains(table[state][input]) && !br.ready()){
return true;
// Table indicates that for this state, we accept the input given
// and we reached the end of the string
}
// Advance to next state.
state = table[state][input];
}else{
break; // it’s not even in the language, break and return false
}
}
// we reached the end of the input,
return false;
}
http://csis.pace.edu/~wolf/CS122/infix-postfix.htm
http://eli.thegreenplace.net/files/docs/forays/col6.html
http://www.fing.edu.uy/inco/cursos/intropln/material/p419-thompson.pdf
http://swtch.com/~rsc/regexp/regexp1.html
http://www.regular-expressions.info/posixbrackets.html
http://xlinux.nist.gov/dads//HTML/sparsegraph.html
http://binarysculpting.com/2012/02/11/regular-expressions-how-do-they-really-work-automata-theory-for-programmers-part-1/
http://binarysculpting.com/2012/02/15/converting-dfa-to-nfa-by-subset-construction-regular-expressions-part-2/
http://www.cs.hunter.cuny.edu/~saad/courses/csci493.66/hw/hw3.pdf
http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
http://web.cecs.pdx.edu/~harry/compilers/slides/LexicalPart3.pdf
http://ezekiel.vancouver.wsu.edu/~cs317/nfa/RegEx/src/regex/NFA.java
https://www.clear.rice.edu/comp412/Lectures/L06LexIII-Subset-1up.pdf
http://www.cs.uaf.edu/~cs631/notes/strings/node6.html
http://www.youtube.com/watch?v=xlrw5Y8SXjA
http://www.youtube.com/watch?v=59quXH2yUFo
http://www.youtube.com/watch?v=iafR7TUZF8I
http://www.youtube.com/watch?v=fYmi2okBG40
http://www.youtube.com/watch?v=3tdePzdkYXE
Some helpful and some not so helpful resources
~ The End ~

More Related Content

What's hot (19)

PDF
Theory of computation and automata
Prof. Dr. K. Adisesha
 
DOC
Chapter 2 2 1 1
bolovv
 
DOCX
15 practical grep command examples in linux
Teja Bheemanapally
 
PPT
String & its application
Tech_MX
 
PDF
Python unit 2 M.sc cs
KALAISELVI P
 
PDF
Theory of computation and automata
Prof. Dr. K. Adisesha
 
PPT
Scala Language Intro - Inspired by the Love Game
Antony Stubbs
 
PDF
Introduction to the theory of computation
prasadmvreddy
 
PPT
Module 11
bittudavis
 
PDF
A regex ekon16
Max Kleiner
 
PDF
Lesson 02 python keywords and identifiers
Nilimesh Halder
 
PPTX
Summary of effective modern c++ item1 2
Young Ha Kim
 
PDF
newperl5
tutorialsruby
 
PPT
Lecture 05 2017
Jesmin Akhter
 
PPTX
LISP: Type specifiers in lisp
DataminingTools Inc
 
PPTX
LISP:Predicates in lisp
DataminingTools Inc
 
PDF
Python revision tour II
Mr. Vikram Singh Slathia
 
Theory of computation and automata
Prof. Dr. K. Adisesha
 
Chapter 2 2 1 1
bolovv
 
15 practical grep command examples in linux
Teja Bheemanapally
 
String & its application
Tech_MX
 
Python unit 2 M.sc cs
KALAISELVI P
 
Theory of computation and automata
Prof. Dr. K. Adisesha
 
Scala Language Intro - Inspired by the Love Game
Antony Stubbs
 
Introduction to the theory of computation
prasadmvreddy
 
Module 11
bittudavis
 
A regex ekon16
Max Kleiner
 
Lesson 02 python keywords and identifiers
Nilimesh Halder
 
Summary of effective modern c++ item1 2
Young Ha Kim
 
newperl5
tutorialsruby
 
Lecture 05 2017
Jesmin Akhter
 
LISP: Type specifiers in lisp
DataminingTools Inc
 
LISP:Predicates in lisp
DataminingTools Inc
 
Python revision tour II
Mr. Vikram Singh Slathia
 

Similar to RegexCat (20)

PPTX
Regular Expression to Finite Automata
Archana Gopinath
 
PDF
Regular Expression to Non-Deterministic Finite Automata Converter
IRJET Journal
 
PPTX
Regular expressions
Eran Zimbler
 
PPTX
Implementation of lexical analyser
Archana Gopinath
 
PPT
finite_automata.ppt
AkhlasHashim
 
PPT
02. Chapter 3 - Lexical Analysis NLP.ppt
charvivij
 
PPT
Ch3.ppt
MDSayem35
 
PPTX
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
PPTX
Lec1.pptx
ziadk6872
 
PPTX
Regular expression
MONIRUL ISLAM
 
PPT
02. chapter 3 lexical analysis
raosir123
 
PPT
compiler Design course material chapter 2
gadisaAdamu
 
PPTX
The Theory of Finite Automata.pptx
ssuser039bf6
 
PPT
Chapter Two(1)
bolovv
 
PPTX
Unit2 Toc.pptx
viswanath kani
 
PDF
Automata_Theory_and_compiler_design_UNIT-1.pptx.pdf
TONY562
 
PPT
Lecture2 B
Hisham Elsherif
 
PDF
Lexicalanalyzer
Royalzig Luxury Furniture
 
PDF
Lexicalanalyzer
Royalzig Luxury Furniture
 
PDF
Lecture 3 RE NFA DFA
Rebaz Najeeb
 
Regular Expression to Finite Automata
Archana Gopinath
 
Regular Expression to Non-Deterministic Finite Automata Converter
IRJET Journal
 
Regular expressions
Eran Zimbler
 
Implementation of lexical analyser
Archana Gopinath
 
finite_automata.ppt
AkhlasHashim
 
02. Chapter 3 - Lexical Analysis NLP.ppt
charvivij
 
Ch3.ppt
MDSayem35
 
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Lec1.pptx
ziadk6872
 
Regular expression
MONIRUL ISLAM
 
02. chapter 3 lexical analysis
raosir123
 
compiler Design course material chapter 2
gadisaAdamu
 
The Theory of Finite Automata.pptx
ssuser039bf6
 
Chapter Two(1)
bolovv
 
Unit2 Toc.pptx
viswanath kani
 
Automata_Theory_and_compiler_design_UNIT-1.pptx.pdf
TONY562
 
Lecture2 B
Hisham Elsherif
 
Lexicalanalyzer
Royalzig Luxury Furniture
 
Lexicalanalyzer
Royalzig Luxury Furniture
 
Lecture 3 RE NFA DFA
Rebaz Najeeb
 
Ad

RegexCat

  • 1. What The Regex_Cat Dragged In Kyle Hamilton COS210 - MCCC, Prof. Bostain December 2013
  • 2. Regular Expression: From Wikipedia, the free encyclopedia In computing, a regular expression (abbreviated regex or regexp) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, an editor, and grep (global regular ex- pression print), a filter. Each character in a regular expression is either understood to be a metacharacter with its spe- cial meaning, or a regular character with its literal meaning. Together, they can be used to iden- tify textual material of a given pattern, or process a number of instances of it that can vary from a precise equality to a very general similarity of the pattern. The pattern sequence itself is an expression that is a statement in a language designed specifically to represent prescribed tar- gets in the most concise and flexible way to direct the automation of text processing of general text files, specific textual forms, or of random input strings. The alphabet, Σ, will consist of all printable ASCII characters except the metacharacters. The meatcharacters in Regex_Cat are: [](){}*+?^$|.
  • 3. What’s supported in Regex_Cat Currently the most basic regular expression operations can be performed: ()[]*+?| To support the POSIX Extended standard, the following features have yet to be added: Escapes: Any special character preceded by an escape shall match itself. Assertions: '^' shall match the start of a line when used as the first character of an expression, or the first character of a sub-expression. '$' shall match the end of a line when used as the last character of an expression, or the last character of a sub-expression. Repetition: a{n} Matches 'a' repeated exactly n times. a{n,} Matches 'a' repeated n or more times. a{n, m} Matches 'a' repeated between n and m times inclusive. Wildcard: The single character '.' when used outside of a character set will match any single character. 12 Character Classes: [:alnum:], [:alpha:], [:blank:], [:cntrl:], [:graph:], [:print:], [:punct:], [:digit:], [:lower:], [:space:], [:upper:], [:xdigit:] Ex: [:alnum:] matches Alphanumeric characters, as would: [a-zA-Z0-9] The latter form is supported in the current Regex_Cat.
  • 4. What Regex_Cat does 1. Parse regex A. Insert explicit concatenation operators. Regex_Cat uses a ‘•’ which is not part of the first 128 printable ASCII characters. Care must be taken to compile Regex_Cat with UTF-8 encoding B. Convert expression from infix to postfix form. C. Save the Alphabet of the regex for later. 2. Build NFA Convert Postfix expression to NFA 3. Convert to DFA Use Subset construction to convert from NFA to DFA Build DFA Transition Table 4. Simulate Process the target string using the DFA Transition Table and Alphabet
  • 5. ArrayList A self expanding array. For storing data elements as well as iterating over said elements. DoublyLinkedList Similar to the ArrayList, but with the ability to iterate in both directions. Stack Great for bouncing stuff around by ‘pushing’ and ‘pop- ping’ elements from stack to stack. The Infix to Postfix algorithm is a good example. Directional List Graph Used in the construction of the NFA, the graph stores Edge data. Each Edge has a source vertex, a desti- nation vertex, and a ‘weight’ - in this case a character in the alphabet of the regex. The graph is implemented as an Adjacency List is used, as opposed to an ‘Adjacency Matrix’, because the NFA is ‘sparse’. A sparse graph is one in which the number of edges is much less than the possible num- ber of edges. The NFA vertices are connected by at most 2 edges, and in most cases, only 1 edge. Directional Matrix Graph The DFA on the other hand is a type of Adjacency Ma- trix, implemented as a 2-dimentional array. This makes the regex engine algorithm simple and fast, with run-times of O(n). The potential trade-off is the DFA’s memory footprint caused by ‘state explosion’ - at worst, the DFA can contain 2^n - 1 number of states of the NFA (where ‘n’ is the number of states). Data Structures
  • 6. Algorithms 1. Infx To Postfix Parses regular expression 2. Thompson-McNaughton-Yamada: Constructs NFA 3. Subset Construction: Converts NFA to DFA Constructs Transition Table 4. Simulation Engine Processes the target String Figure: A really cool visual NFA to DFA simulator: http://hackingoff.com/compilers/regular-expression-to-nfa-dfa a*(b|cd?)+ NFA DFA
  • 7. Infix to Postfix: InfixToPostfix.java The first step is to parse the expression into a form suitable to building the State Machine: Since the input has no explicit concatenation operators, we first insert those. We use the ‘•’ symbol, since it is not within the ASCII set we are supporting, hence it won’t conflict with any regular charcters. We just have to careful when compiling the code to set the JVM to use UTF-8 encoding. The next step is to convert the expression to ‘Polish notation’, or, Postfix form. The algorithm to convert an infix expression into a postfix expression uses a stack. The stack is used to hold operators. The purpose of the stack is to reverse the order of the operators in the expression. It also serves as a storage structure, since no operator can be printed until both of its operands have appeared. We also store the Alphabet of the regex in this step, so we can use it later to build the transition table, and process the target string. A summary of the rules: 1. Print operands as they arrive. 2. If the stack is empty or contains a left parenthesis on top, push the incoming operator onto the stack. 3. If the incoming symbol is a left parenthesis, push it on the stack. 4. If the incoming symbol is a right parenthesis, pop the stack and print the opera- tors until you see a left parenthesis. Discard the pair of parentheses. 5. If the incoming symbol has higher precedence than the top of the stack, push it on the stack. 6. If the incoming symbol has equal precedence with the top of the stack, use as- sociation. If the association is left to right, pop and print the top of the stack and then push the incoming operator. If the association is right to left, push the in- coming operator. 7. If the incoming symbol has lower precedence than the symbol on the top of the stack, pop the stack and print the top operator. Then test the incoming operator against the new top of stack. 8. At the end of the expression, pop and print all operators on the stack. (No parentheses should remain.) Taken from: http://csis.pace.edu/~wolf/CS122/infix-postfix.htm Precendence Collation-related bracket symbols [==] [::] [..] Escaped characters Character set (bracket expression) [] Grouping () Single-character-ERE duplication * + ? {m,n} Concatenation Anchoring ^$ Alternation | InfixToPostfix.java in our package: private static final String OPERATORS = "[](){}*+?•^$|."; private static final int[] PRECEDENCE = {6,6,5,5,4,4,4,4,4,3,2,2,1,1};
  • 8. This illustration uses a + symbol for concatenation. Regex_cat uses + to denote “one or more”, so our implementation uses a • as mentioned on the previous page. Taken from: http://csis.pace.edu/~wolf/CS122/infix-postfix.htm Infix to Postfix illustrated:
  • 9. Finite State Machine | Finite State Automaton Taken from: http://www.thefreedictionary.com/finite+state+machine A model of a computational system, consisting of a set of states (including a start state), an alphabet of symbols that serves as a set of possible inputs to the machine, and a transition function that maps each state to another state (or to itself) for any given input symbol. The machine operates by being fed a string of symbols, and moves through a series of states. NFA(Nondeterministic Finite Automaton) VS. DFA(Deterministic Finite Automaton) Taken from: http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton A nondeterministic finite automaton (NFA), or nondeterministic finite state machine, is a finite state machine that (1) does not require input symbols for state transitions and (2) is capable of transitioning to zero or two or more states for a given start state and input symbol. This distinguishes it from a deterministic finite automaton (DFA), in which all transitions are uniquely determined and in which an input symbol is required for all state transitions. Although NFA and DFA have distinct definitions, all NFAs can be translated to equivalent DFAs using the subset construction algorithm,[1] i.e., constructed DFAs and their corresponding NFAs recognize the same formal language. Like DFAs, NFAs only recognize regular lan- guages. An NFA, similar to a DFA, consumes a string of input symbols. For each input symbol, it transitions to a new state until all input symbols have been consumed. Unlike a DFA, it is non-deterministic, i.e., for some state and input symbol, the next state may be one of two or more possible states. An NFA is represented formally by a 5-tuple, (Q, Σ, Δ, q0, F), consisting of a finite set of states Q a finite set of input symbols Σ ( the Alphabet of the regex ) a transition relation Δ : Q × Σ → P(Q). ( the Edges ) an initial (or start) state q0 ∈ Q a set of states F distinguished as accepting (or final) states F ⊆ Q. A DFA is also represented by a 5-tuple, (Q, Σ, δ, q0, F), with one difference. The DFA has a transition function (δ : Q × Σ → Q) in place of the the transition relation Δ in the NFA.
  • 10. 0 1 ε 3 ε 2 a 5 ε 6 ε 8 ε 4 ε 7 b ε 9 c ε 0 1 a 2 b 3 c NFA with Epsilon States Equivalent DFA Regular Expression: [a-c] Regular expressions and NFAs are equivalent in power: every regular expression has an equivalent NFA (they match the same strings) and vice versa. DFAs are also equivalent in power to NFAs and regular expressions.
  • 11. Thompson-McNaughton-Yamada NFA.java The NFA for a regular expression is built up from partial NFAs for each subexpression, with a different construction for each operator, as well as epsilon states. The Start State and Accepting State of each partial NFA is used to attach the next partial NFA in the building process. The postfix regular expression is used to build an NFA using a stack: while(the regular expression is not exhausted) When a symbol is encountered, an NFA corresponding to that symbol is constructed and pushed onto the stack. When an operator is encountered, a number of NFAs, consistent with the operator, is popped and used to construct a new NFA using the corresponding operator. A new Start state and a new Accept state is assigned. That newly constructed NFA is then pushed onto the stack.
  • 13. Subset Construction: NFA to DFA DFA.java The Subset constrauction algorithm consists of the following functions: getEclosure()calculates epslison closures of the NFA. Each closure becomes a State T of the DFA, and is added to the list of DFA States DS. move()assigns edges to DFA states by iterating over the e-closures. Subset Construction: while (there is an unmarked state T in DS){ mark T for (each letter ‘a’ in the alphabet ){ U = getEclosure(move(T,a)) if(U is not already in DS){ add U to DS as an unmarked state } push transition [T,a]=U onto the dfaTransitionStack of DFA transitons ( implemented as an ArrayList of int[] = {T, a, U} ) } }
  • 14. getEclosure(S): start with the Start State of the NFA push all states onto the States List S for each state in S push state onto stack add state to eClosure list while(stack is not empty){ pop stack if(the edge is an epsilon edge, and the destination edge doesn’t already exist in the eClosure){ push the destination state onto the stack add the destination state to the eClosure list } } return the eClosure list move(S, a) for each state s in List S for each Edge in s in NFA if the weight == the letter ‘a’ of the Alphabet, add desitnation state to list return list Subset Construction con’t:
  • 15. Construct the Transition Table from the list of DFA Transitions: TransTable = new int[number of DFA States][length of the Alphabet]; while(!dfaTransitionStack.empty()){ int[] trans = dfaTransitionStack.pop(); TransTable [trans[0]][trans[1]] = trans[2]; } Subset Construction con’t: Sample Output:
  • 16. Simulation Regex.java // exact match: private boolean eatDFA(int[][] table, String lang, ArrayList<Integer> finalStates, String target) throws IOException{ BufferedReader br = stringToBR(target); int c; int state = 0; while (((c = br.read()) != -1)) { int input = lang.indexOf((char)c); if(input != -1){ // only continue if input char is in the language if (table[state][input] == -1){ // unreachable state return false; }else if ( finalStates.contains(table[state][input]) && !br.ready()){ return true; // Table indicates that for this state, we accept the input given // and we reached the end of the string } // Advance to next state. state = table[state][input]; }else{ break; // it’s not even in the language, break and return false } } // we reached the end of the input, return false; }
  • 17. http://csis.pace.edu/~wolf/CS122/infix-postfix.htm http://eli.thegreenplace.net/files/docs/forays/col6.html http://www.fing.edu.uy/inco/cursos/intropln/material/p419-thompson.pdf http://swtch.com/~rsc/regexp/regexp1.html http://www.regular-expressions.info/posixbrackets.html http://xlinux.nist.gov/dads//HTML/sparsegraph.html http://binarysculpting.com/2012/02/11/regular-expressions-how-do-they-really-work-automata-theory-for-programmers-part-1/ http://binarysculpting.com/2012/02/15/converting-dfa-to-nfa-by-subset-construction-regular-expressions-part-2/ http://www.cs.hunter.cuny.edu/~saad/courses/csci493.66/hw/hw3.pdf http://hackingoff.com/compilers/regular-expression-to-nfa-dfa http://web.cecs.pdx.edu/~harry/compilers/slides/LexicalPart3.pdf http://ezekiel.vancouver.wsu.edu/~cs317/nfa/RegEx/src/regex/NFA.java https://www.clear.rice.edu/comp412/Lectures/L06LexIII-Subset-1up.pdf http://www.cs.uaf.edu/~cs631/notes/strings/node6.html http://www.youtube.com/watch?v=xlrw5Y8SXjA http://www.youtube.com/watch?v=59quXH2yUFo http://www.youtube.com/watch?v=iafR7TUZF8I http://www.youtube.com/watch?v=fYmi2okBG40 http://www.youtube.com/watch?v=3tdePzdkYXE Some helpful and some not so helpful resources