应朋友要求,抓取某公开页面上的信息,做一下记录。代码比较简单,做一个备忘录。主要是涉及到了payload类型。
里边隐去了user-aggent和url,不便展示。其实里边就一个难点,payload数据抓取,原来因为没有用过,而且对前端了解比较少,现在的项目基本都是前后端分离,所以前后端联调的机会比较少。而且后期用到的机会也比较少,所以记录一下,防止下一次用到。
import requests import json import time def get_phonenum(num): file = open('./value.txt','a+',encoding='utf-8') url = '' headers = { 'Host': 'holmes.taobao.com', 'Connection': 'keep-alive', 'Content-Length': '61', 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"', 'Accept': 'application/json, text/plain', 'Content-Type': 'application/json', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'Origin': 'https://www.dingtalk.com', 'Sec-Fetch-Site': 'cross-site', 'Sec-Fetch-Mode': 'cors', 'Sec-Fetch-Dest': 'empty', 'Referer': 'https://www.dingtalk.com/', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9' } payload = { "pageNo": int(num), "pageSize": 100, "keyword": "工程", "orderByType": 5 } data = requests.post(url=url,data=json.dumps(payload),headers=headers).json()['data']['data'] for i in range(len(data)): # column_name= str(list(data[i].keys())).replace('[','').replace(']','').replace("'",'').replace(',','\t') # file.write(str(column_name.encode('utf-8').decode('utf-8'))+'\n') # print(column_name) values = str(list(data[i].values())).replace('[','').replace(']','').replace("'",'').replace(',','\t') # print(values) file.write(str(values.encode('utf-8').decode('utf-8'))+'\n') # print('\r正在输出第%d行数据'%i,end='') if __name__ == '__main__': # get_cpu() for i in range(1,1100): get_phonenum(i) print('\r正在抓取第%d页数据'%i,end='') time.sleep(1)
获取接口数据,尽量从正规渠道拿,不要想着通过接口深入到别人的库里边去,不然会有许多的麻烦的。