Regular Expressions in Java
Regular Expressions A  regular expression  is a kind of pattern that can be applied to text ( String s, in Java)‏ A regular expression either  matches  the text (or part of the text), or it fails to match If a regular expression matches a part of the text, then you can easily find out which part If a regular expression is complex, then you can easily find out which parts of the regular expression match which parts of the text With this information, you can readily extract parts of the text, or do substitutions in the text Regular expressions are an extremely useful tool for manipulating text Regular expressions are heavily used in the automatic generation of Web pages
Perl and Java The  Perl  programming language is heavily used in server-side programming, because Much server-side programming is text manipulation Regular expressions are built into the syntax of Perl Beginning with Java 1.4, Java has a regular expression package,  java.util.regex Java’s regular expressions are almost identical to those of Perl This new capability greatly enhances Java 1.4’s text handling Regular expressions in Java 1.4 are just a normal package, with no new syntax to support them Java’s regular expressions are just as powerful as Perl’s, but Regular expressions are easier and more convenient in Perl
A first example The regular expression  "[a-z]+"  will match a sequence of one or more lowercase letters [a-z]  means any character from  a  through  z , inclusive +  means “one or more” Suppose we apply this pattern to the String  "Now is the time" There are  three ways  we can apply this pattern: To the  entire string:  it fails to match because the string contains characters other than lowercase letters To the  beginning of the string:  it fails to match because the string does not begin with a lowercase letter To  search the string:  it will succeed and match  ow If applied repeatedly, it will find  is , then  the , then  time , then fail
Doing it in Perl and Ruby In both Perl and Ruby, a regular expression is written between forward slashes, for example,  /[a-z]+/ Regular expressions are values, and can be used as such For example,  line.split(/\s+/)‏ We can search for matches to a regular expression with the  =~  operator For example,  name = "Dave"; name =~ /[a-z]/;  will find  ave
Doing it in Java, I First, you must  compile  the pattern import java.util.regex.*; Pattern p = Pattern.compile("[a-z]+"); Next, you must create a  matcher  for a specific piece of text by sending a message to your pattern Matcher m = p.matcher("Now is the time"); Points to notice: Pattern  and  Matcher  are both in  java.util.regex Neither  Pattern  nor  Matcher  has a public constructor; you create these by using methods in the  Pattern  class The matcher contains information about  both  the pattern to use  and  the text to which it will be applied
Doing it in Java, II Now that we have a matcher  m , m.matches()   returns  true  if the pattern matches the entire text string, and  false  otherwise m.lookingAt()  returns  true  if the pattern matches at the beginning of the text string, and  false  otherwise m.find()  returns  true  if the pattern matches any part of the text string, and  false  otherwise If called again,  m.find()   will start searching from where the last match was found m.find()  will return  true  for as many matches as there are in the string; after that, it will return  false   When  m.find()  returns  false , matcher  m  will be  reset  to the beginning of the text string (and may be used again)‏
Finding what was matched After a successful match,   m.start()   will return the index of the first character matched After a successful match,   m.end()   will return the index of the last character matched,  plus one If no match was attempted, or if the match was unsuccessful,  m.start()   and   m.end()  will throw an  IllegalStateException This is a  RuntimeException , so you don’t have to catch it It may seem strange that  m.end()   returns the index of the last character matched plus one, but this is just what most String methods require For example,  "Now is the time".substring(m.start(), m.end())  will return exactly the matched substring
A complete example import java.util.regex.*;   public class RegexTest {   public static void main(String args[]) {   String pattern = "[a-z]+";   String text = "Now is the time";   Pattern p = Pattern.compile(pattern);   Matcher m = p.matcher(text);   while (m.find()) {   System.out.print(text.substring ( m.start(), m.end() )  + "*");   }   } } Output:  ow*is*the*time*
Additional methods If  m  is a matcher, then m.replaceFirst( replacement )  returns a new String where the first substring matched by the pattern has been replaced by  replacement m.replaceAll( replacement )   returns a new String where every substring matched by the pattern has been replaced by  replacement m.find( startIndex )   looks for the next pattern match, starting at the specified index m.reset()   resets this matcher m.reset( newText )   resets this matcher and gives it new text to examine (which may be a  String ,  StringBuffer , or  CharBuffer )‏
Some simple patterns abc exactly this sequence of three letters [abc] any  one  of the letters  a ,  b , or  c [^abc] any character  except  one of the letters  a ,  b , or  c (immediately within an open bracket,  ^  means “not,”  but anywhere else it just means the character  ^ ) [a-z] any  one  character from  a  through  z , inclusive [a-zA-Z0-9] any  one  letter or digit
Sequences and alternatives If one pattern is followed by another, the two patterns must match consecutively For example,  [A-Za-z]+[0-9]   will match one or more letters immediately followed by one digit The vertical bar,  | , is used to separate alternatives For example, the pattern  abc|xyz  will match either  abc  or  xyz
Some predefined character classes . any one character except a line terminator \d a digit:   [0-9] \D a non-digit:  [^0-9] \s a whitespace character:  [  \t\n\x0B\f\r] \S a non-whitespace character:  [^\s] \w a word character:   [a-zA-Z_0-9] \W a non-word character:  [^\w] Notice the space. Spaces are  significant in regular expressions!
Boundary matchers These patterns match the  empty string  if at the specified position: ^ the beginning of a line $ the end of a line \b a word boundary \B not a word boundary \A the beginning of the input (can be multiple lines) \Z the end of the input except for the final terminator, if any \z the end of the input \G the end of the previous match
Greedy quantifiers (The term “greedy” will be explained later)‏ Assume  X  represents some pattern X ? optional,  X  occurs once or not at all X * X  occurs zero or more times X + X  occurs one or more times X { n } X  occurs exactly  n  times X { n ,} X  occurs  n  or more times X { n , m } X  occurs at least  n  but not more than  m  times Note that these are all  postfix  operators, that is, they come  after  the operand
Types of quantifiers A  greedy quantifier  will match as much as it can, and back off if it needs to We’ll do examples in a moment A  reluctant quantifier  will match as little as possible, then take more if it needs to You make a quantifier reluctant by appending a  ? : X ??   X *?   X +?   X { n }?  X { n ,}?   X { n , m }? A  possessive quantifier  will match as much as it can, and never let go You make a quantifier possessive by appending a  + : X ?+   X *+   X ++   X { n }+   X { n ,}+   X { n , m }+
Quantifier examples Suppose your text is  aardvark Using the pattern  a*ardvark  ( a*  is greedy): The  a*  will first match  aa , but then  ardvark  won’t match The  a*  then “backs off” and matches only a single  a , allowing the rest of the pattern ( ardvark ) to succeed Using the pattern  a*?ardvark  ( a*?  is reluctant): The  a*?  will first match zero characters (the null string), but then  ardvark  won’t match The  a*?  then extends and matches the first  a , allowing the rest of the pattern ( ardvark ) to succeed Using the pattern  a*+ardvark  ( a*+  is possessive): The  a*+  will match the  aa , and will not back off, so  ardvark  never matches and the pattern match fails
Capturing groups In regular expressions, parentheses are used for grouping, but they also  capture  (keep for later use) anything matched by that part of the pattern Example:   ([a-zA-Z]*)([0-9]*)  matches any number of letters followed by any number of digits If the match succeeds,   \1  holds the matched letters and  \2   holds the matched digits In addition,   \0   holds everything matched by the entire pattern Capturing groups are numbered by counting their  opening parentheses  from left to right: ( ( A ) ( B ( C ) ) ) 1 2  3  4 \0 = \1 = ((A)(B(C))) ,  \2 = (A) ,  \3 = (B(C)) ,  \4 = (C)‏ Example:   ([a-zA-Z])\1  will match a double letter, such as   le tt er
Capturing groups in Java If  m  is a matcher that has just performed a successful match, then m.group( n )   returns the String matched by capturing group  n This could be an empty string This will be  null  if the pattern as a whole matched but this particular group didn’t match anything m.group()   returns the String matched by the entire pattern (same as  m.group(0) )‏ This could be an empty string If  m  didn’t match (or wasn’t tried), then these methods will throw an  IllegalStateException
Example use of capturing groups Suppose  word  holds a word in English Also suppose we want to move all the consonants at the beginning of  word  (if any) to the end of the word (so  string  becomes  ingstr )‏ Pattern p = Pattern.compile("([ ^ aeiou]*)(.*)"); Matcher m = p.matcher(word); if (m.matches()) { System.out.println(m.group(2) + m.group(1)); } Note the use of   (.*)   to indicate “all the rest of the characters”
Double backslashes Backslashes have a special meaning in regular expressions; for example,   \b   means a word boundary Backslashes have a special meaning in Java; for example,   \b   means the backspace character Java syntax rules apply first! If you write  "\b[a-z]+\b"  you get a string with backspace characters in it--this is  not  what you want! Remember, you can quote a backslash with another backslash, so  "\\b[a-z]+\\b"   gives the correct string Note: if you  read in  a String from somewhere, this does not apply--you get whatever characters are actually there
Escaping metacharacters A lot of special characters--parentheses, brackets, braces, stars, plus signs, etc.--are used in defining regular expressions; these are called  metacharacters Suppose you want to search for the character sequence  a*  (an  a  followed by a star)‏ "a*" ; doesn’t work; that means “zero or more  a s” "a\*" ; doesn’t work; since a star doesn’t  need  to be escaped (in Java String constants), Java just ignores the  \ "a\\*"   does  work; it’s the three-character string  a ,  \ ,  * Just to make things even more difficult, it’s  illegal  to escape a  non -metacharacter in a regular expression
Spaces There is only one thing to be said about spaces (blanks) in regular expressions, but it’s important: Spaces are significant! A space stands for a  space --when you put a space in a pattern, that means to match a space in the text string It’s a  really bad idea  to put spaces in a regular expression just to make it look better
Additions to the  String  class All of the following are  public: public boolean matches(String  regex )‏ public String replaceFirst(String  regex , String  replacement )‏ public String replaceAll(String  regex , String  replacement )‏ public String[ ] split(String  regex )  public String[ ] split(String  regex , int  limit )‏ If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n  is non-positive then the pattern will be applied as many times as possible
Thinking in regular expressions Regular expressions are  not  easy to use at first It’s a bunch of punctuation, not words The individual pieces are not hard, but it takes practice to learn to put them together correctly Regular expressions form a miniature programming language It’s a different kind of programming language than Java, and requires you to learn new thought patterns In Java you can’t just  use  a regular expression; you have to first create Patterns and Matchers Java’s syntax for String constants doesn’t help, either Despite all this, regular expressions bring so much power and convenience to String manipulation that they are well worth the effort of learning
The End “ A little learning is a dangerous thing;   drink deep, or taste not the Pierian spring:   there shallow draughts intoxicate the brain,   and drinking largely sobers us again.”   --Alexander Pope

More Related Content

PPTX
Java: Regular Expression
PPTX
Regular expressions
PPTX
Regular Expressions in Java
PPT
Regular Expressions
PPT
Regular Expression
PPT
Adv. python regular expression by Rj
DOCX
Python - Regular Expressions
PPTX
Regular Expression
Java: Regular Expression
Regular expressions
Regular Expressions in Java
Regular Expressions
Regular Expression
Adv. python regular expression by Rj
Python - Regular Expressions
Regular Expression

What's hot (20)

PPT
Regular expressions
PPTX
Regular expressions
PPTX
Python advanced 2. regular expression in python
KEY
Andrei's Regex Clinic
PDF
Python (regular expression)
PPTX
Finaal application on regular expression
ODP
Regex Presentation
PDF
Regular expressions in Ruby and Introduction to Vim
PPT
Textpad and Regular Expressions
PPTX
Regular expression
PPT
Introduction to regular expressions
PPTX
Regular Expression (Regex) Fundamentals
PPTX
Regular Expressions 101 Introduction to Regular Expressions
PDF
Regular expression
PDF
Strings in Python
PDF
Python strings
PPTX
11. using regular expressions with oracle database
PPTX
Regular expressions
PPT
Regular Expressions grep and egrep
PPTX
Regular expression examples
Regular expressions
Regular expressions
Python advanced 2. regular expression in python
Andrei's Regex Clinic
Python (regular expression)
Finaal application on regular expression
Regex Presentation
Regular expressions in Ruby and Introduction to Vim
Textpad and Regular Expressions
Regular expression
Introduction to regular expressions
Regular Expression (Regex) Fundamentals
Regular Expressions 101 Introduction to Regular Expressions
Regular expression
Strings in Python
Python strings
11. using regular expressions with oracle database
Regular expressions
Regular Expressions grep and egrep
Regular expression examples
Ad

Viewers also liked (7)

PPT
enums
PDF
Working with text, Regular expressions
PPTX
Regexp master 2011
PDF
Expresiones regulares
PPTX
Java GC
PPTX
Graph theory 1
PDF
enums
Working with text, Regular expressions
Regexp master 2011
Expresiones regulares
Java GC
Graph theory 1
Ad

Similar to 16 Java Regex (20)

PPT
Expresiones regulares, sintaxis y programación en JAVA
PPT
M C6java7
PPTX
Regular_Expressions.pptx
PDF
Maxbox starter20
PPT
Java căn bản - Chapter9
PPT
Chapter 9 - Characters and Strings
PDF
Regular expressions
PDF
regular-expression.pdf
PPTX
Regex lecture
PPTX
Strings,patterns and regular expressions in perl
PPTX
Unit 1-strings,patterns and regular expressions
PPTX
unit-4 regular expression.pptx
PPT
Strings.ppt
PDF
Regex startup
PPT
PERL Regular Expression
PDF
Lecture 18 - Regular Expressions.pdf
PPT
Bioinformatica 06-10-2011-p2 introduction
PPSX
Regular expressions in oracle
PDF
Python regular expressions
PDF
Python - Lecture 7
Expresiones regulares, sintaxis y programación en JAVA
M C6java7
Regular_Expressions.pptx
Maxbox starter20
Java căn bản - Chapter9
Chapter 9 - Characters and Strings
Regular expressions
regular-expression.pdf
Regex lecture
Strings,patterns and regular expressions in perl
Unit 1-strings,patterns and regular expressions
unit-4 regular expression.pptx
Strings.ppt
Regex startup
PERL Regular Expression
Lecture 18 - Regular Expressions.pdf
Bioinformatica 06-10-2011-p2 introduction
Regular expressions in oracle
Python regular expressions
Python - Lecture 7

More from wayn (20)

PPT
Orissa Riots
PPT
Global Warmin Powerpoint
PPT
European Exploration And Colonization
PPT
. Elsecretoparaserfeliz
PPT
Jesus Loves You
PPT
Congestive Heart Failure
PPT
Koah Patogenezi
PPT
Paseo Nocturno Night City Tour Of The World
PPT
A Better Way
PPT
Here Comes The Winter
PPT
Fun On The Net
PPT
Alluring Asia
PPT
A Better Way
PPT
Back To School Presentation
PPT
Happy new year2008
PPT
Words.....
PPT
Referee Image Slide
PPT
Why I Got Fired
PPT
New Year 1
PPT
Bluetooth Presentation
Orissa Riots
Global Warmin Powerpoint
European Exploration And Colonization
. Elsecretoparaserfeliz
Jesus Loves You
Congestive Heart Failure
Koah Patogenezi
Paseo Nocturno Night City Tour Of The World
A Better Way
Here Comes The Winter
Fun On The Net
Alluring Asia
A Better Way
Back To School Presentation
Happy new year2008
Words.....
Referee Image Slide
Why I Got Fired
New Year 1
Bluetooth Presentation

Recently uploaded (20)

PPTX
Information-Technology-in-Human-Society.pptx
PDF
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
Intravenous drug administration application for pediatric patients via augmen...
PPT
Overviiew on Intellectual property right
PDF
NewMind AI Journal Monthly Chronicles - August 2025
PDF
Secure Java Applications against Quantum Threats
PDF
Advancements in abstractive text summarization: a deep learning approach
PDF
Be ready for tomorrow’s needs with a longer-lasting, higher-performing PC
PPTX
CRM(Customer Relationship Managmnet) Presentation
PDF
substrate PowerPoint Presentation basic one
PPTX
From XAI to XEE through Influence and Provenance.Controlling model fairness o...
PPTX
From Curiosity to ROI — Cost-Benefit Analysis of Agentic Automation [3/6]
PDF
ELLIE29.pdfWETWETAWTAWETAETAETERTRTERTER
PDF
The Digital Engine Room: Unlocking APAC’s Economic and Digital Potential thro...
PDF
Fitaura: AI & Machine Learning Powered Fitness Tracker
PDF
Addressing the challenges of harmonizing law and artificial intelligence tech...
PDF
Revolutionizing recommendations a survey: a comprehensive exploration of mode...
PPTX
AQUEEL MUSHTAQUE FAKIH COMPUTER CENTER .
PDF
State of AI in Business 2025 - MIT NANDA
Information-Technology-in-Human-Society.pptx
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
Presentation - Principles of Instructional Design.pptx
Intravenous drug administration application for pediatric patients via augmen...
Overviiew on Intellectual property right
NewMind AI Journal Monthly Chronicles - August 2025
Secure Java Applications against Quantum Threats
Advancements in abstractive text summarization: a deep learning approach
Be ready for tomorrow’s needs with a longer-lasting, higher-performing PC
CRM(Customer Relationship Managmnet) Presentation
substrate PowerPoint Presentation basic one
From XAI to XEE through Influence and Provenance.Controlling model fairness o...
From Curiosity to ROI — Cost-Benefit Analysis of Agentic Automation [3/6]
ELLIE29.pdfWETWETAWTAWETAETAETERTRTERTER
The Digital Engine Room: Unlocking APAC’s Economic and Digital Potential thro...
Fitaura: AI & Machine Learning Powered Fitness Tracker
Addressing the challenges of harmonizing law and artificial intelligence tech...
Revolutionizing recommendations a survey: a comprehensive exploration of mode...
AQUEEL MUSHTAQUE FAKIH COMPUTER CENTER .
State of AI in Business 2025 - MIT NANDA

16 Java Regex

  • 2. Regular Expressions A regular expression is a kind of pattern that can be applied to text ( String s, in Java)‏ A regular expression either matches the text (or part of the text), or it fails to match If a regular expression matches a part of the text, then you can easily find out which part If a regular expression is complex, then you can easily find out which parts of the regular expression match which parts of the text With this information, you can readily extract parts of the text, or do substitutions in the text Regular expressions are an extremely useful tool for manipulating text Regular expressions are heavily used in the automatic generation of Web pages
  • 3. Perl and Java The Perl programming language is heavily used in server-side programming, because Much server-side programming is text manipulation Regular expressions are built into the syntax of Perl Beginning with Java 1.4, Java has a regular expression package, java.util.regex Java’s regular expressions are almost identical to those of Perl This new capability greatly enhances Java 1.4’s text handling Regular expressions in Java 1.4 are just a normal package, with no new syntax to support them Java’s regular expressions are just as powerful as Perl’s, but Regular expressions are easier and more convenient in Perl
  • 4. A first example The regular expression "[a-z]+" will match a sequence of one or more lowercase letters [a-z] means any character from a through z , inclusive + means “one or more” Suppose we apply this pattern to the String "Now is the time" There are three ways we can apply this pattern: To the entire string: it fails to match because the string contains characters other than lowercase letters To the beginning of the string: it fails to match because the string does not begin with a lowercase letter To search the string: it will succeed and match ow If applied repeatedly, it will find is , then the , then time , then fail
  • 5. Doing it in Perl and Ruby In both Perl and Ruby, a regular expression is written between forward slashes, for example, /[a-z]+/ Regular expressions are values, and can be used as such For example, line.split(/\s+/)‏ We can search for matches to a regular expression with the =~ operator For example, name = "Dave"; name =~ /[a-z]/; will find ave
  • 6. Doing it in Java, I First, you must compile the pattern import java.util.regex.*; Pattern p = Pattern.compile("[a-z]+"); Next, you must create a matcher for a specific piece of text by sending a message to your pattern Matcher m = p.matcher("Now is the time"); Points to notice: Pattern and Matcher are both in java.util.regex Neither Pattern nor Matcher has a public constructor; you create these by using methods in the Pattern class The matcher contains information about both the pattern to use and the text to which it will be applied
  • 7. Doing it in Java, II Now that we have a matcher m , m.matches() returns true if the pattern matches the entire text string, and false otherwise m.lookingAt() returns true if the pattern matches at the beginning of the text string, and false otherwise m.find() returns true if the pattern matches any part of the text string, and false otherwise If called again, m.find() will start searching from where the last match was found m.find() will return true for as many matches as there are in the string; after that, it will return false When m.find() returns false , matcher m will be reset to the beginning of the text string (and may be used again)‏
  • 8. Finding what was matched After a successful match, m.start() will return the index of the first character matched After a successful match, m.end() will return the index of the last character matched, plus one If no match was attempted, or if the match was unsuccessful, m.start() and m.end() will throw an IllegalStateException This is a RuntimeException , so you don’t have to catch it It may seem strange that m.end() returns the index of the last character matched plus one, but this is just what most String methods require For example, "Now is the time".substring(m.start(), m.end()) will return exactly the matched substring
  • 9. A complete example import java.util.regex.*; public class RegexTest { public static void main(String args[]) { String pattern = "[a-z]+"; String text = "Now is the time"; Pattern p = Pattern.compile(pattern); Matcher m = p.matcher(text); while (m.find()) { System.out.print(text.substring ( m.start(), m.end() ) + "*"); } } } Output: ow*is*the*time*
  • 10. Additional methods If m is a matcher, then m.replaceFirst( replacement ) returns a new String where the first substring matched by the pattern has been replaced by replacement m.replaceAll( replacement ) returns a new String where every substring matched by the pattern has been replaced by replacement m.find( startIndex ) looks for the next pattern match, starting at the specified index m.reset() resets this matcher m.reset( newText ) resets this matcher and gives it new text to examine (which may be a String , StringBuffer , or CharBuffer )‏
  • 11. Some simple patterns abc exactly this sequence of three letters [abc] any one of the letters a , b , or c [^abc] any character except one of the letters a , b , or c (immediately within an open bracket, ^ means “not,” but anywhere else it just means the character ^ ) [a-z] any one character from a through z , inclusive [a-zA-Z0-9] any one letter or digit
  • 12. Sequences and alternatives If one pattern is followed by another, the two patterns must match consecutively For example, [A-Za-z]+[0-9] will match one or more letters immediately followed by one digit The vertical bar, | , is used to separate alternatives For example, the pattern abc|xyz will match either abc or xyz
  • 13. Some predefined character classes . any one character except a line terminator \d a digit: [0-9] \D a non-digit: [^0-9] \s a whitespace character: [ \t\n\x0B\f\r] \S a non-whitespace character: [^\s] \w a word character: [a-zA-Z_0-9] \W a non-word character: [^\w] Notice the space. Spaces are significant in regular expressions!
  • 14. Boundary matchers These patterns match the empty string if at the specified position: ^ the beginning of a line $ the end of a line \b a word boundary \B not a word boundary \A the beginning of the input (can be multiple lines) \Z the end of the input except for the final terminator, if any \z the end of the input \G the end of the previous match
  • 15. Greedy quantifiers (The term “greedy” will be explained later)‏ Assume X represents some pattern X ? optional, X occurs once or not at all X * X occurs zero or more times X + X occurs one or more times X { n } X occurs exactly n times X { n ,} X occurs n or more times X { n , m } X occurs at least n but not more than m times Note that these are all postfix operators, that is, they come after the operand
  • 16. Types of quantifiers A greedy quantifier will match as much as it can, and back off if it needs to We’ll do examples in a moment A reluctant quantifier will match as little as possible, then take more if it needs to You make a quantifier reluctant by appending a ? : X ?? X *? X +? X { n }? X { n ,}? X { n , m }? A possessive quantifier will match as much as it can, and never let go You make a quantifier possessive by appending a + : X ?+ X *+ X ++ X { n }+ X { n ,}+ X { n , m }+
  • 17. Quantifier examples Suppose your text is aardvark Using the pattern a*ardvark ( a* is greedy): The a* will first match aa , but then ardvark won’t match The a* then “backs off” and matches only a single a , allowing the rest of the pattern ( ardvark ) to succeed Using the pattern a*?ardvark ( a*? is reluctant): The a*? will first match zero characters (the null string), but then ardvark won’t match The a*? then extends and matches the first a , allowing the rest of the pattern ( ardvark ) to succeed Using the pattern a*+ardvark ( a*+ is possessive): The a*+ will match the aa , and will not back off, so ardvark never matches and the pattern match fails
  • 18. Capturing groups In regular expressions, parentheses are used for grouping, but they also capture (keep for later use) anything matched by that part of the pattern Example: ([a-zA-Z]*)([0-9]*) matches any number of letters followed by any number of digits If the match succeeds, \1 holds the matched letters and \2 holds the matched digits In addition, \0 holds everything matched by the entire pattern Capturing groups are numbered by counting their opening parentheses from left to right: ( ( A ) ( B ( C ) ) ) 1 2 3 4 \0 = \1 = ((A)(B(C))) , \2 = (A) , \3 = (B(C)) , \4 = (C)‏ Example: ([a-zA-Z])\1 will match a double letter, such as le tt er
  • 19. Capturing groups in Java If m is a matcher that has just performed a successful match, then m.group( n ) returns the String matched by capturing group n This could be an empty string This will be null if the pattern as a whole matched but this particular group didn’t match anything m.group() returns the String matched by the entire pattern (same as m.group(0) )‏ This could be an empty string If m didn’t match (or wasn’t tried), then these methods will throw an IllegalStateException
  • 20. Example use of capturing groups Suppose word holds a word in English Also suppose we want to move all the consonants at the beginning of word (if any) to the end of the word (so string becomes ingstr )‏ Pattern p = Pattern.compile("([ ^ aeiou]*)(.*)"); Matcher m = p.matcher(word); if (m.matches()) { System.out.println(m.group(2) + m.group(1)); } Note the use of (.*) to indicate “all the rest of the characters”
  • 21. Double backslashes Backslashes have a special meaning in regular expressions; for example, \b means a word boundary Backslashes have a special meaning in Java; for example, \b means the backspace character Java syntax rules apply first! If you write "\b[a-z]+\b" you get a string with backspace characters in it--this is not what you want! Remember, you can quote a backslash with another backslash, so "\\b[a-z]+\\b" gives the correct string Note: if you read in a String from somewhere, this does not apply--you get whatever characters are actually there
  • 22. Escaping metacharacters A lot of special characters--parentheses, brackets, braces, stars, plus signs, etc.--are used in defining regular expressions; these are called metacharacters Suppose you want to search for the character sequence a* (an a followed by a star)‏ "a*" ; doesn’t work; that means “zero or more a s” "a\*" ; doesn’t work; since a star doesn’t need to be escaped (in Java String constants), Java just ignores the \ "a\\*" does work; it’s the three-character string a , \ , * Just to make things even more difficult, it’s illegal to escape a non -metacharacter in a regular expression
  • 23. Spaces There is only one thing to be said about spaces (blanks) in regular expressions, but it’s important: Spaces are significant! A space stands for a space --when you put a space in a pattern, that means to match a space in the text string It’s a really bad idea to put spaces in a regular expression just to make it look better
  • 24. Additions to the String class All of the following are public: public boolean matches(String regex )‏ public String replaceFirst(String regex , String replacement )‏ public String replaceAll(String regex , String replacement )‏ public String[ ] split(String  regex ) public String[ ] split(String  regex , int  limit )‏ If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible
  • 25. Thinking in regular expressions Regular expressions are not easy to use at first It’s a bunch of punctuation, not words The individual pieces are not hard, but it takes practice to learn to put them together correctly Regular expressions form a miniature programming language It’s a different kind of programming language than Java, and requires you to learn new thought patterns In Java you can’t just use a regular expression; you have to first create Patterns and Matchers Java’s syntax for String constants doesn’t help, either Despite all this, regular expressions bring so much power and convenience to String manipulation that they are well worth the effort of learning
  • 26. The End “ A little learning is a dangerous thing; drink deep, or taste not the Pierian spring: there shallow draughts intoxicate the brain, and drinking largely sobers us again.” --Alexander Pope