Python爬取知乎评论：多线程与异步爬虫的性能优化

2025-07-08 126

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Python爬取知乎评论：多线程与异步爬虫的性能优化

知乎评论爬取的技术挑战
知乎的评论数据通常采用动态加载（Ajax），这意味着直接使用requests+BeautifulSoup无法获取完整数据。此外，知乎还设置了反爬机制，包括：
● 请求头（Headers）验证（如User-Agent、Referer）
● Cookie/Session 校验（未登录用户只能获取部分数据）
● 频率限制（频繁请求可能导致IP被封）
因此，我们需要：
模拟浏览器请求（携带Headers和Cookies）
解析动态API接口（而非静态HTML）
优化爬取速度（多线程/异步）
获取知乎评论API分析
（1）查找评论API
打开知乎任意一个问题（如 https://www.zhihu.com/question/xxxxxx），按F12进入开发者工具，切换到Network选项卡，筛选XHR请求
（2）解析评论数据结构
评论通常嵌套在data字段中，结构如下：
```
"data": [
 {
   "content": "评论内容",
   "author": { "name": "用户名" },
   "created_time": 1620000000
 }
],
"paging": { "is_end": false, "next": "下一页URL" }
}
```
我们需要递归翻页（paging.next）爬取所有评论。
Python爬取知乎评论的三种方式
（1）单线程爬虫（基准测试）
使用requests库直接请求API，逐页爬取：
```import requests
import time

def fetch_comments(question_id, max_pages=5):
base_url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers"
headers = {
"User-Agent": "Mozilla/5.0",
"Cookie": "你的Cookie" # 登录后获取
}
comments = []
for page in range(max_pages):
url = f"{base_url}?offset={page * 10}&limit=10"
resp = requests.get(url, headers=headers).json()
for answer in resp["data"]:
comments.append(answer["content"])
time.sleep(1) # 避免请求过快
return comments

start_time = time.time()
comments = fetch_comments("12345678") # 替换为知乎问题ID
print(f"单线程爬取完成，耗时：{time.time() - start_time:.2f}秒")

缺点：逐页请求，速度慢（假设每页1秒，10页需10秒）。
（2）多线程爬虫（ThreadPoolExecutor）
使用concurrent.futures实现多线程并发请求：
```from concurrent.futures import ThreadPoolExecutor

def fetch_page(page, question_id):
    url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?offset={page * 10}&limit=10"
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(url, headers=headers).json()
    return [answer["content"] for answer in resp["data"]]

def fetch_comments_multi(question_id, max_pages=5, threads=4):
    with ThreadPoolExecutor(max_workers=threads) as executor:
        futures = [executor.submit(fetch_page, page, question_id) for page in range(max_pages)]
        comments = []
        for future in futures:
            comments.extend(future.result())
    return comments

start_time = time.time()
comments = fetch_comments_multi("12345678", threads=4)
print(f"多线程爬取完成，耗时：{time.time() - start_time:.2f}秒")

优化点：
● 线程池控制并发数（避免被封）
● 比单线程快约3-4倍（4线程爬10页仅需2-3秒）
（3）异步爬虫（Asyncio + aiohttp）
使用aiohttp实现异步HTTP请求，进一步提高效率：
```import aiohttp
import asyncio
import time

代理配置

proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"

async def fetch_page_async(session, page, question_id):
url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?offset={page * 10}&limit=10"
headers = {"User-Agent": "Mozilla/5.0"}
async with session.get(url, headers=headers) as resp:
data = await resp.json()
return [answer["content"] for answer in data["data"]]

async def fetch_comments_async(question_id, max_pages=5):

# 设置代理连接器
proxy_auth = aiohttp.BasicAuth(proxyUser, proxyPass)
connector = aiohttp.TCPConnector(
    limit=20,  # 并发连接数限制
    force_close=True,
    enable_cleanup_closed=True,
    proxy=f"http://{proxyHost}:{proxyPort}",
    proxy_auth=proxy_auth
)

async with aiohttp.ClientSession(connector=connector) as session:
    tasks = [fetch_page_async(session, page, question_id) for page in range(max_pages)]
    comments = await asyncio.gather(*tasks)
return [item for sublist in comments for item in sublist]

if name == "main":
start_time = time.time()
comments = asyncio.run(fetch_comments_async("12345678")) # 替换为知乎问题ID
print(f"异步爬取完成，耗时：{time.time() - start_time:.2f}秒")
print(f"共获取 {len(comments)} 条评论")
```
优势：
● 无GIL限制，比多线程更高效
● 适合高并发IO密集型任务（如爬虫）

性能对比与优化建议
爬取方式 10页耗时（秒）适用场景
单线程 ~10 少量数据，简单爬取
多线程（4线程） ~2.5 中等规模，需控制并发
异步（Asyncio） ~1.8 大规模爬取，高并发需求
优化建议
控制并发数：避免触发反爬（建议10-20并发）。
随机延迟：time.sleep(random.uniform(0.5, 2)) 模拟人类操作。
代理IP池：防止IP被封（如使用requests+ProxyPool）。
数据存储优化：异步写入数据库（如MongoDB或MySQL）。

Python爬取知乎评论：多线程与异步爬虫的性能优化

代理配置

热门文章

最新文章

相关课程

相关电子书

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Python爬取知乎评论：多线程与异步爬虫的性能优化

代理配置

热门文章

最新文章

相关课程

相关电子书

推荐镜像