SlideShare a Scribd company logo
LEXING AND PARSING 
THE BEGINNER’S GUIDE
WHY ARE WE DOING THIS? 
• bbcode 
• html 
• xml 
• programming language
BUT I CAN JUST REGEX 
• sometimes you can 
• sometimes you can’t 
• is your html well formed? (view source some time) 
• it depends!!
CHOMSKY HIERARCHY
COMPUTER SCIENCE 
WE LIKE ACRONYMS AND WEIRD WORDS
ENGLISH IS HARD! 
• tokenizer 
• scanner 
• lexer 
• parser 
• lexical analyzer 
• syntactic analyzer 
• formal grammar
LEXICAL ANALYSIS 
BREAK DOWN INPUT INTO A SEQUENCE OF TOKENS 
LEXING
SCANNING 
• Finite State Machine 
• Finds Lexemes 
• Might backtrack
FINITE STATE MACHINE
EVALUATOR 
• looks at lexeme to get value 
• lexeme + value = token
LEXING PHP - $Y = 5; 
• $y 
• array[309, ‘$y’, 1], 
• = 
• = 
• 5 
• array[305, 5, 1] 
• 309 == T_VARIABLE 
• 305 == T_LNUMBER
LEXER GENERATORS 
DO NOT WRITE THIS BY HAND 
Famous 
• lex 
• flex 
• re2c 
• ANTLR 
• DFASTAR 
• jflex 
• jlex 
• quex 
PHP generators 
• https://github.com/oliverheins/PHPSimpleLexYacc 
• lex syntax 
• https://github.com/pear/PHP_LexerGenerator 
• re2c syntax 
• https://github.com/wez/JLexPHP 
• jlex syntax 
• token_get_all (see php-parser) 
• parse_ini_file/string (combined with parser)
RE2C
IN PHP LAND
SYNTACTIC ANALYSIS 
CONSTRUCTING SOMETHING BASED ON A GRAMMAR 
PARSING
THE PARSING PROCESS 
• Tokens come in 
• Magic 
• Data structure comes out 
• parse tree 
• AST
GRAMMAR (FORMAL OF COURSE) 
• "Brave men run in my family.” 
• I can't recommend this book too highly. 
• Prostitutes Appeal to Pope 
• I had had my car for four years before I ever learned to drive it.
TYPES OF PARSERS 
• Top Down 
• Recursive Decent 
• LL (left to right, leftmost derivation) 
• Earley parser 
• Bottom Up 
• Precedence parser 
• Operator-precedence parser 
• Simple precedence parser 
• BC (bounded context) parsing 
• LR parser (Left-to-right, Rightmost derivation) 
• Simple LR (SLR) parser 
• LALR parser 
• Canonical LR (LR(1)) parser 
• GLR parser 
• CYK parser 
• Recursive ascent parser
SENTENCE DIAGRAMMING 
• People who live in glass house shouldn't throw 
stones.
PARSE TREE
TOP DOWN VS. BOTTOM UP PARSING
PARSE TREES 
• Constituency-based parse trees 
• Dependency-based parse trees
AST 
• Not everything appears 
• additional information may be applied 
• can “improve” tree nodes 
• PHP is getting one!
LALR(K) 
• Look ahead prevents “ambiguous” parsing 
• I have one token, what token comes next?
PARSER GENERATORS 
Famous 
• bison 
• bison 
• bison 
• bison 
• yacc 
• lemon 
• ANTLR 
PHP versions 
• https://github.com/wez/lemon-php 
• https://github.com/pear/PHP_ParserGenerator 
• lemon 
• https://github.com/scato/phpeg 
• peg (peg.js) 
• https://github.com/jakubkulhan/pacc 
• yacc
BISON 
• Generates LALR (or GLR) parsers 
• Code in C, C++ or Java 
• reentrant with %define api.pure set 
• used by ALL THE THINGS 
• PHP 
• Ruby 
• Postgresql 
• Go
BISON IN C
LEMON 
• Generates LALR(1) parser 
• reentrant AND thread safe 
• non-terminal destructor (leak avoidance) 
• pull parsing 
• sqlite
PHP LEMON
REENTRANT VS THREAD SAFE 
• Process 
• Thread 
• Locking 
• Scope 
• Reentrant
COMPILE IT 
• transform programming language to computer language
INTERPRET IT 
• directly executes programming language
PROFIT
UNDER THE HOOD 
WHAT USES THIS STUFF?
PHP 
RE2C + Bison + these crazy opcodes….
LALR(1) WRITTEN BY HAND 
How - pythonic
HHVM 
Flex and Bison and JIT – OH MY!
SQLITE 
Lemon is tasty!
WRITING PARSERS AND LEXERS 
THEORIES OF CODING
STEP 1: THINK SMALL 
• Writing a general purpose parser is hard – that’s why you use PHP 
• Writing a single purpose parser is much easier 
• markup text (markdown) 
• configuration or definition files (behat/gherkin syntax) 
• complex validation (addresses in multiple formats)
STEP 2: SEPARATE AND UNOPTIMIZED 
• premature optimization yada yada 
• combine after it’s ready to be used (or not at if you’ll need to change it later) 
• lexer and parser each have unique, well defined goals 
• the ability to potentially switch parser styles later will help you!
STEP 3: LEXER 
• the lexer's job is to recognize tokens 
• it can do this via a giant switch statement of doom 
• or maybe a giant loop 
• or maybe a list of goto statements 
• or maybe a complex class with methods 
• …. or you can just use a generator
LET’S BREAK THAT DOWN 
1. Define a token format 
2. Define grammar format (what are we looking for?) 
3. Go over the input data (usually a string) and make matches 
1. compare or regex or ctype_* or however it make sense 
4. Keep track of your current state 
5. Have an output format – AST, tree, whatever
STEP 4: PARSER 
• Loop over our tokens 
• Look at the values and decide to what to do
STEP 5: DO SOMETHING WITH IT! 
1. Compile – write out to something that can be run (html) 
2. Interpret – run through another program to get output (templates to html) 
3. Analyze – run through to analyze the data inside (code analysis/sniffer tools) 
4. Validate – check for proper “spelling and grammar” 
5. ??? 
6. PROFIT
“If you’re not sure how to do a job – ask!” 
- silly poster on my laundry room wall
RESOURCES 
• http://savage.net.au/Ron/html/graphviz2.marpa/Lexing.and.Parsing.Overview.html 
• http://nikic.github.io/2011/10/23/Improving-lexing-performance-in-PHP.html 
• https://github.com/hafriedlander/php-peg 
• https://github.com/nikic/PHP-Parser/ 
• http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html 
• http://wikipedia.org
CONTACT ME 
• auroraeosrose@gmail.com 
• auroraeosrose – freenode.net #phpmentoring #phpwomen 
• Twitter - @auroraeosrose 
• http://github.com/auroraeosrose

More Related Content

What's hot (20)

PPT
Python - Introduction
stn_tkiller
 
PPTX
Learn python – for beginners
RajKumar Rampelli
 
PPTX
Parsing (Automata)
ROOP SAGAR
 
PDF
Introduction to python programming
Srinivas Narasegouda
 
PPT
Python ppt
Mohita Pandey
 
PPTX
Introduction to Python Programming
Akhil Kaushik
 
PPT
F# and the DLR
Richard Minerich
 
PPTX
Full Python in 20 slides
rfojdar
 
PPTX
Introduction to python
MaheshPandit16
 
PPTX
Python programming introduction
Siddique Ibrahim
 
PPTX
Clonedigger-Python
Sangharsh agarwal
 
PDF
Raspberry using Python Session 1
Mohamed Abd Ela'al
 
PPTX
Groovy Programming Language
Aniruddha Chakrabarti
 
PDF
Ruby 3の型解析に向けた計画
mametter
 
PDF
Coffee 'n code: Regexes
Phil Ewels
 
PDF
JRuby, Not Just For Hard-Headed Pragmatists Anymore
Erin Dees
 
PPTX
Python
Aashish Jain
 
PPT
ppt9
callroom
 
PPT
ppt18
callroom
 
Python - Introduction
stn_tkiller
 
Learn python – for beginners
RajKumar Rampelli
 
Parsing (Automata)
ROOP SAGAR
 
Introduction to python programming
Srinivas Narasegouda
 
Python ppt
Mohita Pandey
 
Introduction to Python Programming
Akhil Kaushik
 
F# and the DLR
Richard Minerich
 
Full Python in 20 slides
rfojdar
 
Introduction to python
MaheshPandit16
 
Python programming introduction
Siddique Ibrahim
 
Clonedigger-Python
Sangharsh agarwal
 
Raspberry using Python Session 1
Mohamed Abd Ela'al
 
Groovy Programming Language
Aniruddha Chakrabarti
 
Ruby 3の型解析に向けた計画
mametter
 
Coffee 'n code: Regexes
Phil Ewels
 
JRuby, Not Just For Hard-Headed Pragmatists Anymore
Erin Dees
 
Python
Aashish Jain
 
ppt9
callroom
 
ppt18
callroom
 

Viewers also liked (20)

PPTX
Write Your Own Compiler in 24 Hours
Phillip Trelford
 
PPTX
Creating own language made easy
Ingvar Stepanyan
 
PPT
Big Data
NGDATA
 
PDF
[Infographic] How will Internet of Things (IoT) change the world as we know it?
InterQuest Group
 
PPTX
How To Collect Requirments Slide Share
Robert_
 
PDF
Introduction
Royalzig Luxury Furniture
 
PPT
Introduction to course
nikit meshram
 
PPTX
Complier designer
Jagjit Wilku
 
PDF
Named Entities
Knut O. Hellan
 
PPT
4 lexical and syntax
Munawar Ahmed
 
PDF
4 lexical and syntax analysis
jigeno
 
PDF
LR Parsing
Eelco Visser
 
PPTX
Natural Language Processing in AI
Saurav Shrestha
 
PDF
Natural Language Processing
Michael Browning
 
KEY
Let's build a parser!
Boy Baukema
 
PPT
Compiler Design Basics
Akhil Kaushik
 
PPT
Module 11
bittudavis
 
PDF
NLP_session-3_Alexandra
Alexandra M. Liguori, Ph.D.
 
PDF
NLP_lectures_English
Alexandra M. Liguori, Ph.D.
 
Write Your Own Compiler in 24 Hours
Phillip Trelford
 
Creating own language made easy
Ingvar Stepanyan
 
Big Data
NGDATA
 
[Infographic] How will Internet of Things (IoT) change the world as we know it?
InterQuest Group
 
How To Collect Requirments Slide Share
Robert_
 
Introduction to course
nikit meshram
 
Complier designer
Jagjit Wilku
 
Named Entities
Knut O. Hellan
 
4 lexical and syntax
Munawar Ahmed
 
4 lexical and syntax analysis
jigeno
 
LR Parsing
Eelco Visser
 
Natural Language Processing in AI
Saurav Shrestha
 
Natural Language Processing
Michael Browning
 
Let's build a parser!
Boy Baukema
 
Compiler Design Basics
Akhil Kaushik
 
Module 11
bittudavis
 
NLP_session-3_Alexandra
Alexandra M. Liguori, Ph.D.
 
NLP_lectures_English
Alexandra M. Liguori, Ph.D.
 
Ad

Similar to Lexing and parsing (20)

PPTX
Not Everything is an Object - Rocksolid Tour 2013
Gary Short
 
PPTX
ANTLR - Writing Parsers the Easy Way
Michael Yarichuk
 
KEY
Rails development environment talk
Reuven Lerner
 
PDF
Performance and Abstractions
Metosin Oy
 
PDF
PureScript Tutorial 1
Ray Shih
 
PPTX
Scaling with swagger
Tony Tam
 
PDF
Exploring Natural Language Processing in Ruby
Kevin Dias
 
PPTX
Functional programming
Prateek Jain
 
PDF
Jsonsaga 100605143125-phpapp02
Ramamohan Chokkam
 
PDF
sete linguagens em sete semanas
tdc-globalcode
 
ZIP
Meta Programming in Ruby - Code Camp 2010
ssoroka
 
PDF
JSR 335 / java 8 - update reference
sandeepji_choudhary
 
PDF
Functional Programming for Busy Object Oriented Programmers
Diego Freniche Brito
 
PDF
Functional Ruby
Amoniac OÜ
 
PDF
Funtional Ruby - Mikhail Bortnyk
Ruby Meditation
 
PDF
Hibernate ORM: Tips, Tricks, and Performance Techniques
Brett Meyer
 
PDF
FP Days: Down the Clojure Rabbit Hole
Christophe Grand
 
PPTX
An Introduction to Elastic Search.
Jurriaan Persyn
 
KEY
Message:Passing - lpw 2012
Tomas Doran
 
KEY
Messaging, interoperability and log aggregation - a new framework
Tomas Doran
 
Not Everything is an Object - Rocksolid Tour 2013
Gary Short
 
ANTLR - Writing Parsers the Easy Way
Michael Yarichuk
 
Rails development environment talk
Reuven Lerner
 
Performance and Abstractions
Metosin Oy
 
PureScript Tutorial 1
Ray Shih
 
Scaling with swagger
Tony Tam
 
Exploring Natural Language Processing in Ruby
Kevin Dias
 
Functional programming
Prateek Jain
 
Jsonsaga 100605143125-phpapp02
Ramamohan Chokkam
 
sete linguagens em sete semanas
tdc-globalcode
 
Meta Programming in Ruby - Code Camp 2010
ssoroka
 
JSR 335 / java 8 - update reference
sandeepji_choudhary
 
Functional Programming for Busy Object Oriented Programmers
Diego Freniche Brito
 
Functional Ruby
Amoniac OÜ
 
Funtional Ruby - Mikhail Bortnyk
Ruby Meditation
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Brett Meyer
 
FP Days: Down the Clojure Rabbit Hole
Christophe Grand
 
An Introduction to Elastic Search.
Jurriaan Persyn
 
Message:Passing - lpw 2012
Tomas Doran
 
Messaging, interoperability and log aggregation - a new framework
Tomas Doran
 
Ad

More from Elizabeth Smith (20)

PPTX
Welcome to the internet
Elizabeth Smith
 
PPTX
Database theory and modeling
Elizabeth Smith
 
PPTX
Taming the resource tiger
Elizabeth Smith
 
PPTX
Modern sql
Elizabeth Smith
 
PPTX
Php extensions
Elizabeth Smith
 
PPTX
Taming the resource tiger
Elizabeth Smith
 
PPTX
Php internal architecture
Elizabeth Smith
 
PPTX
Taming the tiger - pnwphp
Elizabeth Smith
 
PPTX
Php extensions
Elizabeth Smith
 
PPTX
Php’s guts
Elizabeth Smith
 
PPT
Hacking with hhvm
Elizabeth Smith
 
PPTX
Security is not a feature
Elizabeth Smith
 
PPTX
Using unicode with php
Elizabeth Smith
 
PPTX
Mentoring developers-php benelux-2014
Elizabeth Smith
 
PPTX
Using unicode with php
Elizabeth Smith
 
PPTX
Socket programming with php
Elizabeth Smith
 
PPTX
Mentoring developers
Elizabeth Smith
 
PPTX
Do the mentor thing
Elizabeth Smith
 
PPTX
Spl in the wild - zendcon2012
Elizabeth Smith
 
PPTX
Mentoring developers - Zendcon 2012
Elizabeth Smith
 
Welcome to the internet
Elizabeth Smith
 
Database theory and modeling
Elizabeth Smith
 
Taming the resource tiger
Elizabeth Smith
 
Modern sql
Elizabeth Smith
 
Php extensions
Elizabeth Smith
 
Taming the resource tiger
Elizabeth Smith
 
Php internal architecture
Elizabeth Smith
 
Taming the tiger - pnwphp
Elizabeth Smith
 
Php extensions
Elizabeth Smith
 
Php’s guts
Elizabeth Smith
 
Hacking with hhvm
Elizabeth Smith
 
Security is not a feature
Elizabeth Smith
 
Using unicode with php
Elizabeth Smith
 
Mentoring developers-php benelux-2014
Elizabeth Smith
 
Using unicode with php
Elizabeth Smith
 
Socket programming with php
Elizabeth Smith
 
Mentoring developers
Elizabeth Smith
 
Do the mentor thing
Elizabeth Smith
 
Spl in the wild - zendcon2012
Elizabeth Smith
 
Mentoring developers - Zendcon 2012
Elizabeth Smith
 

Recently uploaded (20)

PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PDF
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
PDF
What companies do with Pharo (ESUG 2025)
ESUG
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PDF
System Center 2025 vs. 2022; What’s new, what’s next_PDF.pdf
Q-Advise
 
PDF
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PDF
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
PDF
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PDF
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PDF
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
What companies do with Pharo (ESUG 2025)
ESUG
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
Presentation about variables and constant.pptx
kr2589474
 
System Center 2025 vs. 2022; What’s new, what’s next_PDF.pdf
Q-Advise
 
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
Protecting the Digital World Cyber Securit
dnthakkar16
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 

Lexing and parsing

  • 1. LEXING AND PARSING THE BEGINNER’S GUIDE
  • 2. WHY ARE WE DOING THIS? • bbcode • html • xml • programming language
  • 3. BUT I CAN JUST REGEX • sometimes you can • sometimes you can’t • is your html well formed? (view source some time) • it depends!!
  • 5. COMPUTER SCIENCE WE LIKE ACRONYMS AND WEIRD WORDS
  • 6. ENGLISH IS HARD! • tokenizer • scanner • lexer • parser • lexical analyzer • syntactic analyzer • formal grammar
  • 7. LEXICAL ANALYSIS BREAK DOWN INPUT INTO A SEQUENCE OF TOKENS LEXING
  • 8. SCANNING • Finite State Machine • Finds Lexemes • Might backtrack
  • 10. EVALUATOR • looks at lexeme to get value • lexeme + value = token
  • 11. LEXING PHP - $Y = 5; • $y • array[309, ‘$y’, 1], • = • = • 5 • array[305, 5, 1] • 309 == T_VARIABLE • 305 == T_LNUMBER
  • 12. LEXER GENERATORS DO NOT WRITE THIS BY HAND Famous • lex • flex • re2c • ANTLR • DFASTAR • jflex • jlex • quex PHP generators • https://github.com/oliverheins/PHPSimpleLexYacc • lex syntax • https://github.com/pear/PHP_LexerGenerator • re2c syntax • https://github.com/wez/JLexPHP • jlex syntax • token_get_all (see php-parser) • parse_ini_file/string (combined with parser)
  • 13. RE2C
  • 15. SYNTACTIC ANALYSIS CONSTRUCTING SOMETHING BASED ON A GRAMMAR PARSING
  • 16. THE PARSING PROCESS • Tokens come in • Magic • Data structure comes out • parse tree • AST
  • 17. GRAMMAR (FORMAL OF COURSE) • "Brave men run in my family.” • I can't recommend this book too highly. • Prostitutes Appeal to Pope • I had had my car for four years before I ever learned to drive it.
  • 18. TYPES OF PARSERS • Top Down • Recursive Decent • LL (left to right, leftmost derivation) • Earley parser • Bottom Up • Precedence parser • Operator-precedence parser • Simple precedence parser • BC (bounded context) parsing • LR parser (Left-to-right, Rightmost derivation) • Simple LR (SLR) parser • LALR parser • Canonical LR (LR(1)) parser • GLR parser • CYK parser • Recursive ascent parser
  • 19. SENTENCE DIAGRAMMING • People who live in glass house shouldn't throw stones.
  • 21. TOP DOWN VS. BOTTOM UP PARSING
  • 22. PARSE TREES • Constituency-based parse trees • Dependency-based parse trees
  • 23. AST • Not everything appears • additional information may be applied • can “improve” tree nodes • PHP is getting one!
  • 24. LALR(K) • Look ahead prevents “ambiguous” parsing • I have one token, what token comes next?
  • 25. PARSER GENERATORS Famous • bison • bison • bison • bison • yacc • lemon • ANTLR PHP versions • https://github.com/wez/lemon-php • https://github.com/pear/PHP_ParserGenerator • lemon • https://github.com/scato/phpeg • peg (peg.js) • https://github.com/jakubkulhan/pacc • yacc
  • 26. BISON • Generates LALR (or GLR) parsers • Code in C, C++ or Java • reentrant with %define api.pure set • used by ALL THE THINGS • PHP • Ruby • Postgresql • Go
  • 28. LEMON • Generates LALR(1) parser • reentrant AND thread safe • non-terminal destructor (leak avoidance) • pull parsing • sqlite
  • 30. REENTRANT VS THREAD SAFE • Process • Thread • Locking • Scope • Reentrant
  • 31. COMPILE IT • transform programming language to computer language
  • 32. INTERPRET IT • directly executes programming language
  • 34. UNDER THE HOOD WHAT USES THIS STUFF?
  • 35. PHP RE2C + Bison + these crazy opcodes….
  • 36. LALR(1) WRITTEN BY HAND How - pythonic
  • 37. HHVM Flex and Bison and JIT – OH MY!
  • 38. SQLITE Lemon is tasty!
  • 39. WRITING PARSERS AND LEXERS THEORIES OF CODING
  • 40. STEP 1: THINK SMALL • Writing a general purpose parser is hard – that’s why you use PHP • Writing a single purpose parser is much easier • markup text (markdown) • configuration or definition files (behat/gherkin syntax) • complex validation (addresses in multiple formats)
  • 41. STEP 2: SEPARATE AND UNOPTIMIZED • premature optimization yada yada • combine after it’s ready to be used (or not at if you’ll need to change it later) • lexer and parser each have unique, well defined goals • the ability to potentially switch parser styles later will help you!
  • 42. STEP 3: LEXER • the lexer's job is to recognize tokens • it can do this via a giant switch statement of doom • or maybe a giant loop • or maybe a list of goto statements • or maybe a complex class with methods • …. or you can just use a generator
  • 43. LET’S BREAK THAT DOWN 1. Define a token format 2. Define grammar format (what are we looking for?) 3. Go over the input data (usually a string) and make matches 1. compare or regex or ctype_* or however it make sense 4. Keep track of your current state 5. Have an output format – AST, tree, whatever
  • 44. STEP 4: PARSER • Loop over our tokens • Look at the values and decide to what to do
  • 45. STEP 5: DO SOMETHING WITH IT! 1. Compile – write out to something that can be run (html) 2. Interpret – run through another program to get output (templates to html) 3. Analyze – run through to analyze the data inside (code analysis/sniffer tools) 4. Validate – check for proper “spelling and grammar” 5. ??? 6. PROFIT
  • 46. “If you’re not sure how to do a job – ask!” - silly poster on my laundry room wall
  • 47. RESOURCES • http://savage.net.au/Ron/html/graphviz2.marpa/Lexing.and.Parsing.Overview.html • http://nikic.github.io/2011/10/23/Improving-lexing-performance-in-PHP.html • https://github.com/hafriedlander/php-peg • https://github.com/nikic/PHP-Parser/ • http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html • http://wikipedia.org
  • 48. CONTACT ME • [email protected] • auroraeosrose – freenode.net #phpmentoring #phpwomen • Twitter - @auroraeosrose • http://github.com/auroraeosrose

Editor's Notes

  • #2: Why I got started with this I’ve never taken a computer class I wanted to understand why PHP worked the way it does because I’d been pondering putting some eventing/asyncn magic inside and I ended up down this deep computer science pit where compilers are at the bottom
  • #3: Lexers are used to recognize "words" that make up language elements, because the structure of such words is generally simple. Regular expressions are extremely good at handling this simpler structure, and there are very high-performance regular-expression matching engines used to implement lexers. Parsers are used to recognize "structure" of a language phrases. Such structure is generally far beyond what "regular expressions" can recognize, so one needs "context sensitive" parsers to extract such structure. Context-sensitive parsers are hard to build, so the engineering compromise is to use "context-free" grammars and add hacks to the parsers ("symbol tables", etc.) to handle the context-sensitive part.
  • #4: Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.
  • #5: A formal grammar defines (or generates) a formal language, which is a (usually infinite) set of finite-length sequences of symbols (i.e. strings) that may be constructed by applying production rules to another sequence of symbols which initially contains just the start symbol Type-0 grammars (unrestricted grammars) include all formal grammars. Type-1 grammars (context-sensitive grammars) generate the context-sensitive languages. Type-2 grammars (context-free grammars) generate the context-free languages. Type-3 grammars (regular grammars) generate the regular languages.
  • #6: So computer science is a really weird discipline quite a bit of what computer science is and does comes from – well – math and the other part – the “language” aspects and even concepts of grammar and meaning are from “English” or “language arts” as my kids school calls it the only “science” Part that I think really applies is that we test theories and apply logic  at it’s core remember computers are algorithms (rules) and information (data) but “computer science” has grown to encompass LOTS of things What we’re going to talk about is a small but fundamental window – lexing and parsing – so lets start with words
  • #7: Ask for people seeing these terms Ask if anyone knows a definition of these terms, even a non-computer science definition so almost all of these terms have different meanings depending on their context in computer science definitions are what we’re going to be using we’re also going to mention that some terms get thrown around a bit (parser and scanner are the two worst) but I’m also going to attempt to help you build your own internal rules so you don’t confuse yourself and others by always using them in the “computer science dictionary” manner
  • #8: Scanner == first stage of lexer Strictly speaking, a lexer is itself a kind of parser but we won’t EVER call it a parser cause CONFUSION the syntax of some programming languages are divided into two pieces: the lexical syntax (token structure), which is processed by the lexer; and the phrase syntax, which is processed by the parser The lexical syntax is usually a regular language, whose alphabet consists of the individual characters of the source code text. The phrase syntax is usually a context-free language, whose alphabet consists of the tokens produced by the lexer. While this is a common separation, alternatively, a lexer can be combined with the parser in scannerless parsing. I would say though _ DO NOT DO THIS it may seem easier in the short term but when you have to start changing stuff you will have PAIN
  • #9: Finite state machine – we have a finite (bounded) list of states and the machine can be in one state at any one time Because a finite state machine can represent any history and a reaction, by regarding the change of state as a response to the history it has been argued that it is a sufficient model of human behaviour  i.e. humans are finite state machines. lexeme == characters that have been matched by our state machine needs to be translated to a value
  • #10: States – happy, sad, angry inputs – money, food, kick in pants outputs – smile, frown, punch back set up example of state machine for people
  • #11: sometimes there isn’t’ a value (parentheses in a programming language, for example) sometimes a lexeme is suppressed (comments anyone?) sometimes even a lexeme or token is ADDED by the lexer line continuation (C code) semi-colon insertion (lazy bad javascript! and go? really!) off-side rule – blocks with indents (oh python) or braces (php and C and friends) context sensitivity good lexers are NOT context-sensitive the more look ahead, look back, and backtracking
  • #12: so discuss a little bit about PHP it’s lexer is exposed with token_get_all it’ll “parse”/”tokenize” lex is the correct term, the PHP fed to it this is why there are many parsers written in PHP but not really any lexers, it’s in there  This is GENERALLY the easy part! what is the 1? – line numbers
  • #13: ANTLR - Can generate lexical analyzers and parsers. DFASTAR - Generates DFA matrix table-driven lexers in C++. Flex - Alternative variant of the classic "lex" (C/C++). JFlex - A rewrite of JLex. Ragel - A state machine and lexer generator with output in C, C++, C#, Objective-C, D, Java, Go and Ruby. The following lexical analysers can handle Unicode: JavaCC - JavaCC generates lexical analyzers written in Java. JLex - A lexical analyzer generator for Java. Quex - A fast universal lexical analyzer generator for C and C++. SO if you’re generating
  • #14: rules, named definitions and in-place configurations.
  • #16: ah, the overloading of the word parsing syntactic analysis and grammar looks at the data sent and builds a model – usually some kind of data structure or tree, for what that model looks like just like in English we take grammar to define ideas
  • #17: A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure – giving a structural representation of the input, checking for correct syntax in the process you can do scannerless (again with the silly overloading of words) – a “non lexed” parser but – sigh
  • #18: A formal grammar is a set of rules for rewriting strings, along with a "start symbol" from which rewriting starts Parsing is the process of recognizing an utterance (a string in natural languages) by breaking it down to a set of symbols and analyzing each one against the grammar of the language why what comes before and after can be important when parsing your brain is a very good parser
  • #19: one first looks at the highest level of the parse tree and works don the parse tree by using the rewriting rules of a formal grammar. top down parsers can be small and powerful and readable, although it can be slower a top down parser with a direct path is going to beat a more complex path a bottom up can be faster but you need to match the type of parser with what you’re doing
  • #21: so let’s take a theoretical piece of code that’s been lexed into these values into a “parse tree” – we’ll get into that in a moment
  • #22: The opposite of this are top-down parsing methods, in which the input's overall structure is decided (or guessed at) first, before dealing with mid-level parts, leaving the lowest-level small details to last. A top-down parser discovers and processes the hierarchical tree starting from the top, and incrementally works its way downwards and rightwards. Top-down parsing eagerly decides what a construct is much earlier, when it has only scanned the leftmost symbol of that construct and has not yet parsed any of its parts. Left corner parsing is a hybrid method which works bottom-up along the left edges of each subtree, and top-down on the rest of the parse tree. If a language grammar has multiple rules that may start with the same leftmost symbols but have different endings, then that grammar can be efficiently handled by a deterministic bottom-up parse but cannot be handled top-down without guesswork and backtracking. So bottom-up parsers handle a somewhat larger range of computer language grammars than do deterministic top-down parsers. Bottom-up parsing is sometimes done by backtracking. But much more commonly, bottom-up parsing is done by a shift-reduce parser such as a LALR parser.
  • #23: ordered, rooted tree that represents the syntactic structure of a string their structure and elements more concretely reflect the syntax of the input language constituency based – parts – noun, verb, adverb They are simpler on average than constituency-based parse trees because they contain many fewer nodes – so dependency would say noun, verb, adverb constituency would be sentence, noun phrase, verb phrase, and breaks it down into smaller pieces
  • #24: abstract syntax tree The syntax is "abstract" in not representing every detail appearing in the real syntax. grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches. a
  • #25: LALR – look ahead left to right rightmost derivation – the look ahead can be different depending on the parser type – but bison and friends are all LALR(1) generators
  • #26: bison is re-entrant but NOT thread safe
  • #27: Bison reads a specification of a context-free language, warns about any parsing ambiguities, and generates a parser (either in C, C++, or Java) which reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar note that bison is re-entrant – it’s not by default thread safe (these are two different things)
  • #29: Lemon requires to write more rules in comparison with Bison because of simplified syntax: no repetitions and optionals, one action per rule, etc. Complete set of LALR(1) parser limitations. Only the C language.
  • #31: reentrant if it can be interrupted in the middle of its execution and then safely called again ("re-entered") before its previous invocations complete execution A reentrant subroutine can achieve thread-safety,[1] but being reentrant alone might not be sufficient to be thread-safe in all situations. Conversely, thread-safe code does not necessarily have to be reentrant (see below for examples). A piece of code is thread-safe if it only manipulates shared data structures in a manner that guarantees safe execution by multiple threads at the same time
  • #32: compilers generally write out to assembly or machine code but technically anything can be compiled down to something to be run (plug reckit)
  • #33: interpreter is a computer program that directly executes, i.e. performs, instructions written in a programming or scripting language, without previously compiling them into a machine language program
  • #37: PHP bison file PHP bison C output
  • #39: hand written lexer and lemon parser
  • #40: A parser is a program which processes an input and "understands" it a lexer is a program which splits something into tokens and assigns it a value There are steps you can take to make doing this easier and make you feel less “OMG I’m WRITING A PARSER” or you can cheat and just use a generator
  • #41: So when you first get started think of something small
  • #43: Each of these types of lexer’s are going to have their advantages and disavantages The trick here is not let the lexer do more than it’s supposed to it should be context free or you’ll hate yourself later if you absolutely positively have to lookahead or lookbehind you’ll hate yourself later put as much information into your token definition as you want