使用Crawlee-Python进行网页数据抓取实战指南-CSDN博客

使用Crawlee-Python进行网页数据抓取实战指南

【免费下载链接】crawlee-python Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation. 项目地址: https://gitcode.com/GitHub_Trending/cr/crawlee-python

前言

在当今数据驱动的时代，网页抓取(Web Scraping)已成为获取互联网数据的重要手段。本文将深入介绍如何使用Crawlee-Python这一强大的Python爬虫框架，从电子商务网站中提取结构化产品数据。

数据抓取目标分析

在开始编写代码前，我们需要明确要抓取的数据字段。以电子产品商店为例，我们计划收集以下关键信息：

产品URL
制造商名称
产品SKU编码
产品标题
当前价格
库存状态

这些数据将帮助我们构建完整的产品数据库，用于价格监控、库存分析等商业智能应用。

基础数据提取技巧

从URL中提取制造商信息

产品URL通常包含有价值的信息。例如： https://example.com/products/sennheiser-mke-440-professional-microphone

我们可以通过字符串处理提取制造商名称：

url = "https://example.com/products/sennheiser-mke-440-professional-microphone"
manufacturer = url.split('/')[-1].split('-')[0]
# 结果: 'sennheiser'

技术要点：

split('/')将URL按斜杠分割
[-1]获取最后一部分路径
split('-')[0]获取第一个连字符前的内容

注意事项：

当制造商名称本身包含连字符时，这种方法可能失效
应考虑使用正则表达式或从详情页提取作为备选方案

使用CSS选择器定位元素

提取产品标题

产品标题通常位于<h1>标签内。通过浏览器开发者工具检查元素结构后，我们可以构建精确的选择器：

title = await context.page.locator('.product-meta h1').text_content()

选择器解析：

.product-meta h1：选择class为product-meta元素下的所有h1标签
text_content()获取元素的文本内容

调试技巧：

在开发者工具中使用Ctrl+F测试CSS选择器
确保选择器返回唯一元素

提取SKU编码

SKU通常位于特定class的span元素中：

sku = await context.page.locator('span.product-meta__sku-number').text_content()

复杂数据处理

价格信息提取

价格元素通常包含额外文本和格式化字符，需要特殊处理：

price_element = context.page.locator('span.price', has_text='$').first
current_price_string = await price_element.text_content() or ''
raw_price = current_price_string.split('$')[1]
price = float(raw_price.replace(',', ''))

处理流程：

定位包含$符号的价格元素
提取文本内容(如"Sale price$1,398.00")
分割字符串获取数字部分
移除逗号并转换为浮点数

库存状态检测

库存信息通常通过特定文本标识：

in_stock_element = context.page.locator(
    selector='span.product-form__inventory',
    has_text='In stock',
).first
in_stock = await in_stock_element.count() > 0

逻辑说明：

查找包含"In stock"文本的元素
通过count()判断元素是否存在
返回布尔值表示库存状态

完整示例代码

async def handle_product_page(context):
    # 提取URL和制造商
    url_part = context.request.url.split('/').pop()
    manufacturer = url_part.split('-')[0]
    
    # 提取产品标题
    title = await context.page.locator('.product-meta h1').text_content()
    
    # 提取SKU
    sku = await context.page.locator('span.product-meta__sku-number').text_content()
    
    # 提取价格
    price_element = context.page.locator('span.price', has_text='$').first
    current_price_string = await price_element.text_content() or ''
    raw_price = current_price_string.split('$')[1]
    price = float(raw_price.replace(',', ''))
    
    # 检查库存
    in_stock_element = context.page.locator(
        'span.product-form__inventory',
        has_text='In stock',
    ).first
    in_stock = await in_stock_element.count() > 0
    
    # 返回结构化数据
    return {
        'url': context.request.url,
        'manufacturer': manufacturer,
        'title': title,
        'sku': sku,
        'price': price,
        'in_stock': in_stock
    }

结果输出示例

运行爬虫后将获得格式化的JSON数据：

{
    "url": "https://example.com/products/sony-receiver",
    "manufacturer": "sony",
    "title": "Sony STR-ZA810ES Receiver",
    "sku": "SON-692802-STR-DE",
    "price": 698.0,
    "in_stock": true
}

最佳实践建议

数据去重：考虑是否需要在结果中存储可以从URL派生的数据
错误处理：为每个提取操作添加异常处理
选择器优化：优先使用class和id等稳定属性
数据验证：对提取结果进行格式验证
性能考量：避免不必要的DOM操作

下一步学习方向

掌握基础数据抓取后，可以进一步学习：

数据存储到本地文件或数据库
处理分页和无限滚动
应对反爬机制
构建分布式爬虫系统

通过Crawlee-Python框架，开发者可以高效地构建稳定、可维护的网络爬虫，满足各种数据采集需求。本文介绍的技术可以灵活调整应用于不同的网站结构，是网页抓取开发的基础技能。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考