一、urllib、urllib2、requests介绍
1.1 urllib介绍
1、urllib.urlopen()
2、urllib.urlretrieve()
urllib.urlretrieve(url.filname=None,reporthook=None,data=None)
1.2 urllib2介绍
1、urllib2.urlopen()
2、urllib2.Requests()
示例:
import urllib.request
url = 'http://www.baidu.com'
r = urllib.request.urlopen(url)#对百度发起一次请求
print (r.read())
import urllib.request
urllib.request.urlretrieve('https://www.baidu.com/img/flexible/logo/pc/result.png',filename='D:\\baidu.png')
下载百度图片并且重命名为baidu.png放在D盘的根目录下
注意:Python 3.x 以上的版本,请注意python 3.x 以上的版本 urllib 和urllib2 已经被集合到一个包里 urllib 中
参考链接:https://blog.csdn.net/weixin_45598506/article/details/112303268
1.3 requests介绍
1、安装
pip install requests
2、发送网络请求
r = requests.get(‘http://www.ichunqiu.com’)
r = requests.post (‘http://www.ichunqiu.com’)
r = reauests.put (‘http://www.ichunqiu.com’)
r = reauests.delete (‘http://www.ichunqiu.com’)
r = requests.head (‘http://www.ichunqiu.com’)
r = requests.options ('http://www.ichunqiu.com")
3、为URL传递参数
payload = {‘key1’: 'value1", 'key2": 'value2"}
r= requests.get(“http://httpbin.org/get”, params=payload)
print(r.url)
输出结果: http://httpbin.org/get?key2=value2&key1=value1
4、响应内容
r = requests.get(http://www.ichunqiu.com’)
r.text
r.encoding ‘utf-8’
r.encoding = "ISO-8859-1’
5、二进制响应内容
r = requests.get(http://www.ichunqiu.com’)
r.content
6、定制请求头
url = “http://www.ichunqiu.com’
headers = ('content-type” : ‘application/json’)
r = requests.get(url, headers=headers)
注: headers中可以加入cookies
7、复杂的POST请求
payload = ("key1 “: ‘value1’ , “key2”: ‘value2’”}
r = requests.post(“http://httpbin.org/post” , data=payload)
8、响应状态码
r = requests.get(http://www.ichunqiu.com’)
r.status_code
输出结果:200
9、响应头
r.headers
10、Cookies
r.cookies
r.cookies[‘example cookie name’]
11、超时
requests.get(http://www.ichunqiu.com ', timeout=0.001)
12、错误和异常
遇到网络问题(如:DNS查询失败、拒绝连接等)时,Requests会抛出一个ConnectionError_异常。
遇到罕见的无效HTTP响应时,Requests则会抛出一个 HTTPError异常。
若请求超时,则抛出一个Timeout异常。
示例一:
import requests
url = 'http://www.baidu.com'
r = requests.get(url)#对百度发起一次请求
print(r.text)
print(r.content)#返回的是二进制响应内容
示例二:
import requests
url = 'http://www.baidu.com'
#r = requests.get(url)#对百度发起一次请求
#print(r.text)
#print(r.content)#返回的是二进制响应内容
#print(r.status_code)#输出返回的状态码 200 或者 404等
#print(r.headers)#输出响应头
#print(r.cookies)#输出访问百度的cookies
#请求超时的设置有利于爬虫的效率
r = requests.get(url,timeout=0.001)
示例三:
import requests
url = 'http://www.baidu.com'
r = requests.get(url,timeout=0.001)
#因为请求百度没有在0.001秒内完成响应,所以超时,报错,抛出 time out 请求超时异常
#requests.exceptions.ReadTimeout: HTTPConnectionPool(host='www.baidu.com', port=80): Read timed out. (read timeout=0.001)
二、爬虫的介绍
网络爬虫(又被称为网页蜘蛛,网络机器人,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。
用爬虫最大的好处是批量且自动化得获取和处理信息。对于宏观或者微观的情况都可以多一个侧面去了解。
三、利用python开发一个爬虫
示例一:读取json文件中course里面的result值
# coding=utf-8
import requests
import json
url='https://www.ichunqiu.com/courses/ajaxCourses'
payload = 'courseTag=&courseDiffcuty=1&IsExp=&producerId=&orderField=&orderDirection=&pageIndex=5&tagType=&isOpen='
#复制浏览器的UA,默认的UA属于爬虫UA被waf拦截
headers = {
'Host':'www.ichunqiu.com',
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0',
}
r = requests.post(url,headers=headers,data=payload)
data = json.loads(r.text)
print(data['course'])#读取json文件中course的值
print(data['course']['result'])#读取json文件中course里面的result值
示例二:爬取所有的课程名信息并输出
# coding=utf-8
import requests
import json
payload_start = 'courseTag=&courseDiffcuty=1&IsExp=&producerId=&orderField=&orderDirection=&pageIndex='
def lesson(payload):
url='https://www.ichunqiu.com/courses/ajaxCourses'
#payload = 'courseTag=&courseDiffcuty=1&IsExp=&producerId=&orderField=&orderDirection=&pageIndex='
#复制浏览器的UA,默认的UA属于爬虫UA被waf拦截
headers = {
'Cookie': 'ci_session=ea12fe98d0b99f9cfc7de37d51e34805ec566686; chkphone=acWxNpxhQpDiAchhNuSnEqyiQuDIO0O0O; __jsluid_s=e09df2c83e087903c72b4d33caca7c93; Hm_lvt_2d0601bd28de7d49818249cf35d95943=1662127088; Hm_lpvt_2d0601bd28de7d49818249cf35d95943=1662128935',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'X-Requested-With': 'XMLHttpRequest',
'Content-Length': '103',
'Origin': 'https://www.ichunqiu.com',
'Referer': 'https://www.ichunqiu.com/courses/nandu-chu',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'Te': 'trailers',
}
r = requests.post(url,headers=headers,data=payload)
data = json.loads(r.text)
name_long = int(data['course']['perPageSize'])#每页有多少个课程名称
#name_long = len(data['course']['result'])#方法1:json文件中course里面的有多少个result
#print(name_long)
#print(data['course']['result'][0]['courseName'])#读取json文件中course里面的result中的第一个courseName
for i in range(name_long):
print(data['course']['result'][i]['courseName'])
#使用for循环读取到8页课程名称
for i in range(1,9):#共12页的课程
payload =payload_start+str(i)+'&tagType=&isOpen='
lesson(payload)