抓取网页数据

最新推荐文章于 2025-06-03 17:21:29 发布

PT、小小马

最新推荐文章于 2025-06-03 17:21:29 发布

阅读量1k

点赞数

CC 4.0 BY-SA版权

文章标签： python 开发语言

本文链接：https://blog.csdn.net/qq_44862918/article/details/123539246

本文记录了一段Python爬虫代码，用于抓取公开页面上的信息。代码中定义了get_phonenum函数，通过POST请求发送payload数据，包含页码、每页大小、关键词和排序方式。数据抓取完成后，将结果写入文件。博客提醒注意合法合规地获取接口数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

应朋友要求，抓取某公开页面上的信息，做一下记录。代码比较简单，做一个备忘录。主要是涉及到了payload类型。

里边隐去了user-aggent和url，不便展示。其实里边就一个难点，payload数据抓取，原来因为没有用过，而且对前端了解比较少，现在的项目基本都是前后端分离，所以前后端联调的机会比较少。而且后期用到的机会也比较少，所以记录一下，防止下一次用到。

import requests
import json
import time

def get_phonenum(num):
    file = open('./value.txt','a+',encoding='utf-8')
    url = ''
    headers = {
        'Host': 'holmes.taobao.com',
        'Connection': 'keep-alive',
        'Content-Length': '61',
        'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"',
        'Accept': 'application/json, text/plain',
        'Content-Type': 'application/json',
        'sec-ch-ua-mobile': '?0',
        
        'sec-ch-ua-platform': '"Windows"',
        'Origin': 'https://www.dingtalk.com',
        'Sec-Fetch-Site': 'cross-site',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Dest': 'empty',
        'Referer': 'https://www.dingtalk.com/',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9'
    }
    payload = {
  "pageNo": int(num),
  "pageSize": 100,
  "keyword": "工程",
  "orderByType": 5
}
    data = requests.post(url=url,data=json.dumps(payload),headers=headers).json()['data']['data']
    for i in range(len(data)):
        # column_name= str(list(data[i].keys())).replace('[','').replace(']','').replace("'",'').replace(',','\t')
        # file.write(str(column_name.encode('utf-8').decode('utf-8'))+'\n')
        # print(column_name)
        values = str(list(data[i].values())).replace('[','').replace(']','').replace("'",'').replace(',','\t')
        # print(values)
        file.write(str(values.encode('utf-8').decode('utf-8'))+'\n')
        # print('\r正在输出第%d行数据'%i,end='')



if __name__ == '__main__':
    # get_cpu()
    for i in range(1,1100):
        get_phonenum(i)
        print('\r正在抓取第%d页数据'%i,end='')
        time.sleep(1)

获取接口数据，尽量从正规渠道拿，不要想着通过接口深入到别人的库里边去，不然会有许多的麻烦的。