正则表达式
Python通过re模块来支持正则表达式。Python中有两种方式进行匹配match()方法和search()方法。
- re.match()
尝试从字符串的起始位置匹配一个模式,如果不是起始位置匹配成功的话,match()就返回None。
- 常规匹配
“`python
#导入re模块
import re
#要匹配的字符
content = "Hello 123 4567 World_This is a Regex Demo"
#调用re.match()函数进行匹配,第一个参数为正则表达式,第二个为要匹配的字符。
result = re.match("^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$",content)
#输出函数的匹配信息
print(result)
#输出结果如下:
<_sre.SRE_Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
#输出匹配结果的范围
print(result.span())
#输出匹配结果
print(result.group())
```
- 泛匹配
```python
import re
content = "Hello 123 4567 World_This is a Regex Demo"
result = re.match("^Hello.*Demo$",content)
print(result.group())
#匹配结果如下
Hello 123 4567 World_This is a Regex Demo
```
- 匹配目标
```python
import re
content = "Hello 1234 567 World_This is a Regex Demo"
result = re.match("^Hello\s(\d+)\s(\d+)\sWorld.*Demo$",content)
#group()函数的第一个参数的数字代表正则表达式中相同次序括号中匹配的内容
print(result.group(1))
#参数2意为正则表达式中第二个括号中的匹配内容
print(result.group(2))
#输出结果如下
1234
567
```
- 贪婪匹配
```python
import re
content = "Hello 1234567 World_This is a Regex Demo"
#".*"能匹配任意个数的任意字符(换行符除外),默认情况下会尽可能多的匹配字符
result = re.match("^He.*(\d+).*Demo$",content)
print(result.group(1))
#因为正则表达式中".*"尽可能多的匹配字符,所以group(1)中就会尽可能少的匹配内容,输出结果为:
7
```
- 非贪婪匹配
```python
import re
content = "Hello 1234567 World_This is a Regex Demo"
#这里的正则表达式中,".*"后面加了"?",就会尽可能少的匹配内容
result = re.match("^He.*?(\d+).*Demo$",content)
print(result.group(1))
#所以匹配内容为:
1234567
```
- 匹配模式
```python
import re
content = '''Hello 1234567 World_This
is a Regex Demo'''
#"content"字符串里包含换行符,而".*"不能匹配到换行符,所以匹配结果会是None
result = re.match("^He.*?(\d+).*Demo$",content)
#在match()函数的第三个参数输入"re.S"就能使".*"匹配到换行符了。
result = re.match("^He.*?(\d+).*Demo$",content,re.S)
print(result.group(1))
#匹配结果为:
1234567
```
- 转义
```python
import re
strings = 'price is $5.00'
#"$"和"."是正则表达式的关键字符,需要用"\"进行转义才能正常匹配
result = re.match("price is \$5\.00",strings,re.S)
print(result.group())
```
re.match()缺点:必须从第一个字符开始匹配。
尽量使用泛匹配、使用括号得到匹配目标、尽量使用非贪婪模式、有换行符就用re.S
- re.search()
能从字符串的任意处开始进行匹配
“`python
import re
content = 'hello world hello aCandy hello 1234567 language orange i will find my way'
#"\w"是任意大小写字母和数字,"{}"意为前字符的个数,这里理解为6个"\w"
result = re.search("hello\s(\w{6})\shello\s(\d+)",content)
print(result.group(1))
print(result.group(2))
#输出结果为:
aCandy
1234567
```
- 匹配演练
“`python
import re
content = '''<img class="currentImg" id="currentImg"
onload="alog && alog('speed.set', 'c_firstPageComplete', +new Date);
alog.fire && alog.fire('mark');"
src="https://timgsa.baidu.com/timg?image&quality=80&size=b10000_10000&
sec=1508249402&di=726304faa8ce0b082af59fda8f4fb2fe&
src=http://news.k618.cn/wap/201503/W020150322411690270811.jpg"
width="428.66666666667" height="643" style="top: 35px; left: 199px; width: 512px;
height: 768px; cursor: pointer;" log-rightclick="p=5.102" title="点击查看源网页">'''
result = re.search("src=(http:.*\.jpg)",content,re.S)
print(result.group(1))
#匹配结果如下:
http://news.k618.cn/wap/201503/W020150322411690270811.jpg
```
如果存在多个相同匹配结果,re.search()函数只会输出第一个匹配结果
- re.findall()
匹配所有符合条件的字符。
“`python
#导入”requests”模块,用于获取HTTP请求
import requests
#导入”re”模块用于正则表达式匹配
import re
#请求URL
url = "http://dbj.99114.com/Corporation/l_%E8%B1%86%E7%93%A3_10855_0_0_1.html"
#开始请求,将返回结果赋值给"response"
response = requests.get(url)
#利用"re.findall()"函数匹配所有符合条件的字符
result = re.findall('<h2>.*?<a\stitle="(.*?)".*?href="(.*?)"\starget=.*?</h2>',response.text,re.S)
#遍历匹配结果
for item in result:
print(item[0]+':'+item[1])
#输出结果如下:
郫县会富豆瓣厂:http://shop.99114.com/10991587
郫县三桥豆瓣厂:http://shop.99114.com/47923186
郫县鑫星豆瓣酿造****:http://shop.99114.com/47916602
郫县帅乔酱园厂:http://shop.99114.com/48112509
郾城区新新酱料酿造厂:http://shop.99114.com/48114561
成都红牌楼川菜调料******:http://shop.99114.com/43816920
重庆市永川区佳美调味品****** :http://shop.99114.com/43819140
山西晋之源食品**** :http://shop.99114.com/43819241
上海味加味食品科技****:http://shop.99114.com/43818232
威海市韩味源贸易****:http://shop.99114.com/43817323
四川省威远泉威食品******:http://shop.99114.com/43819039
锦江区邓氏豆瓣调味品配送经营部:http://shop.99114.com/48124985
青岛影都食品****:http://shop.99114.com/43817929
成都兆丰和食品****:http://shop.99114.com/48123513
成都靓马调味品****:http://shop.99114.com/47900904
沂源县养益多食品****:http://shop.99114.com/43818938
刘学谦(个体经营):http://shop.99114.com/44984095
重庆德康食品**** :http://shop.99114.com/43818131
四川省丹丹调味品****:http://shop.99114.com/43815809
金牛区亮亮娃食品商贸部:http://shop.99114.com/48109703
成都市金福猴食品股份****:http://shop.99114.com/48118601
成都保卫食品****:http://shop.99114.com/48373431
四川恒信胜商贸****:http://shop.99114.com/48377917
重庆十之味食品销售****:http://shop.99114.com/43815708
简阳先后食品****:http://shop.99114.com/43817727
北京北方丰润贸易****:http://shop.99114.com/43818030
成都市郫县红九久调味品****:http://shop.99114.com/48375101
通海恩德食品****:http://shop.99114.com/48234357
四川先锋生态园调味品****:http://shop.99114.com/14857895
郫县七里香调味品厂:http://shop.99114.com/47913051
```
以上也算是爬虫demo
- re.sub()
替换指定的字符串
“`python
import re
content = "hello world hello 123456 language orange"
#sub()函数的第一个参数为要被替换的字符串,第二个为替换字符串,第三为要替换其中字符串的数据。将替换的数据重新赋值给原数据
content = re.sub("\d+","SSS",content)
#重新输出字符串
print(content)
#输出结果为:将"\d+"里的内容(也就是数字内容)替换为"SSS"
hello world hello SSS language orange
content = "hello world hello 123456 language orange"
#此处将要被替换的字符串用"()"包括并在前面加上"r",同正则表达式里的"()",替换数据就可以采用"group(1)"的方式用"\"转义
#可以理解替换字符为"123456 hhjj"
content = re.sub("(\d+)",r'\1 hhjj',content)
print(content)
#所以输出结果为:
hello world hello 123456 hhjj language orange
```
- re.compile()
将正则表达式字符串编译成正则表达式对象,复用时不需要多次书写。
“`python
import re
#两个需要匹配的字符串
content = '''hello world
hello 123456 language_orange'''
test = '''hello Jack hello Rose hello 65665665965 aCandy language'''
#将正则表达式编译为正则表达式对象,就可以多次复用
pattern = re.compile("hello\s\d+",re.S)
#上面正则表达式对象已经定义了"re.S",匹配时就无需定义,否则会报错。
result = re.search(pattern,content)
#两组匹配就可以使用同一个正则表达式对象,不需要再重新书写正则表达式了。
result2 = re.search(pattern,test)
print(result.group())
print(result2.group())
#输出结果为:
hello 123456
hello 65665665965
```