Python 数据采集全攻略:从基础到实战的技术指南

目录

  1. 数据采集概述

  2. Python 环境配置

  3. HTTP 协议基础

  4. 网页解析技术

  5. 数据存储方案

  6. 高级采集技术

  7. 数据清洗与处理

  8. 实战项目

  9. 道德与法律考量

  10. 总结与资源

1. 数据采集概述 <a name="数据采集概述"></a>

1.1 什么是数据采集

数据采集(Web Scraping)是指通过自动化程序从网站提取信息的过程。与手动复制粘贴相比,自动化数据采集可以高效地获取大量结构化数据,为数据分析、市场研究和机器学习提供数据源。

1.2 为什么使用 Python 进行数据采集

Python 成为数据采集的首选语言原因如下:

特性说明
丰富的库生态系统Requests, BeautifulSoup, Scrapy, Selenium 等
简单易学的语法代码可读性高,学习曲线平缓
强大的数据处理能力Pandas, NumPy 等库便于后续数据处理
跨平台兼容性可在Windows、MacOS和Linux上运行
社区支持庞大的开发者社区,丰富的学习资源

1.3 数据采集的法律与道德考量

在进行数据采集前,必须了解相关法律和道德准则:

注意事项说明
robots.txt遵守网站的robots.txt文件规定
服务条款尊重网站的使用条款
访问频率合理控制请求频率,避免对网站造成负担
数据用途明确数据用途,尊重版权和隐私
身份标识使用适当的User-Agent标识爬虫身份

2. Python 环境配置 <a name="Python-环境配置"></a>

2.1 Python 安装与配置

数据采集需要安装Python及相关库,以下是推荐的环境配置:

组件版本说明
Python3.8+建议使用最新稳定版
pip最新版Python包管理工具
virtualenv最新版创建隔离的Python环境

安装步骤:

  1. 访问 Python官网 下载并安装Python

  2. 验证安装:在终端/CMD中输入 python --version

  3. 更新pip:pip install --upgrade pip

  4. 安装virtualenv:pip install virtualenv

2.2 创建虚拟环境

使用虚拟环境可以避免包冲突:

bash

# 创建虚拟环境
python -m venv scraping_env

# 激活虚拟环境 (Windows)
scraping_env\Scripts\activate

# 激活虚拟环境 (MacOS/Linux)
source scraping_env/bin/activate

2.3 安装必要库

以下是数据采集所需的核心库:

库名称用途安装命令
requests发送HTTP请求pip install requests
BeautifulSoup4HTML解析pip install beautifulsoup4
lxml快速XML/HTML解析pip install lxml
selenium浏览器自动化pip install selenium
scrapy爬虫框架pip install scrapy
pandas数据处理与分析pip install pandas
numpy数值计算pip install numpy

2.4 开发工具推荐

工具类型推荐工具特点
IDEPyCharm强大的Python专用IDE
文本编辑器VS Code轻量级,插件丰富
浏览器工具Chrome DevTools分析网页结构,调试爬虫
API测试Postman测试API接口

3. HTTP 协议基础 <a name="HTTP-协议基础"></a>

3.1 HTTP 请求与响应

HTTP(超文本传输协议)是数据采集的基础,了解其工作原理至关重要:

组件说明
请求方法GET, POST, PUT, DELETE等
状态码200(成功), 404(未找到), 500(服务器错误)等
请求头User-Agent, Cookie, Referer等
响应头Content-Type, Set-Cookie等
请求体POST请求中发送的数据

3.2 常用HTTP状态码

状态码含义常见场景
200OK请求成功
301Moved Permanently永久重定向
302Found临时重定向
400Bad Request错误请求
403Forbidden禁止访问
404Not Found页面不存在
500Internal Server Error服务器内部错误
503Service Unavailable服务不可用

3.3 使用 Requests 库发送HTTP请求

Requests是Python中最常用的HTTP库,简单易用:

python

import requests

# 发送GET请求
response = requests.get('https://api.example.com/data')

# 检查请求是否成功
if response.status_code == 200:
    print('请求成功!')
    print(response.text)  # 响应内容
else:
    print(f'请求失败,状态码: {response.status_code}')

# 发送带参数的GET请求
params = {'key1': 'value1', 'key2': 'value2'}
response = requests.get('https://api.example.com/data', params=params)

# 发送POST请求
data = {'username': 'user', 'password': 'pass'}
response = requests.post('https://api.example.com/login', data=data)

# 设置请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'application/json'
}
response = requests.get('https://api.example.com/data', headers=headers)

3.4 处理 cookies 和会话

python

import requests

# 创建会话对象维持cookies
session = requests.Session()

# 首先登录获取cookies
login_data = {'username': 'user', 'password': 'pass'}
session.post('https://example.com/login', data=login_data)

# 使用已有cookies访问需要认证的页面
response = session.get('https://example.com/dashboard')
print(response.text)

# 手动处理cookies
response = requests.get('https://example.com')
cookies = response.cookies

# 使用获取的cookies发送后续请求
response = requests.get('https://example.com/protected', cookies=cookies)

3.5 处理异常和超时

python

import requests
from requests.exceptions import RequestException

try:
    # 设置超时时间(连接超时和读取超时)
    response = requests.get('https://example.com', timeout=(3.05, 27))
    
    # 抛出HTTP错误状态码异常
    response.raise_for_status()
    
    print(response.text)
    
except RequestException as e:
    print(f'请求错误: {e}')
except Timeout:
    print('请求超时')
except ConnectionError:
    print('连接错误')
except HTTPError as e:
    print(f'HTTP错误: {e}')

4. 网页解析技术 <a name="网页解析技术"></a>

4.1 HTML 基础结构

了解HTML结构是解析网页的前提:

html

<!DOCTYPE html>
<html>
<head>
    <title>网页标题</title>
</head>
<body>
    <div id="content">
        <h1 class="title">主标题</h1>
        <p class="text">段落文本</p>
        <ul>
            <li>列表项1</li>
            <li>列表项2</li>
        </ul>
        <a href="https://example.com">链接</a>
    </div>
</body>
</html>

4.2 使用 BeautifulSoup 解析HTML

BeautifulSoup是Python中最流行的HTML解析库:

python

from bs4 import BeautifulSoup
import requests

# 获取网页内容
response = requests.get('https://example.com')
html_content = response.text

# 创建BeautifulSoup对象
soup = BeautifulSoup(html_content, 'lxml')  # 或者使用 'html.parser'

# 通过标签名查找元素
title = soup.title  # 获取<title>标签
title_text = soup.title.text  # 获取<title>标签的文本

# 通过CSS选择器查找元素
first_paragraph = soup.select_one('p')  # 第一个<p>标签
all_paragraphs = soup.select('p')  # 所有<p>标签

# 通过属性查找元素
div_with_id = soup.find('div', id='content')  # id为content的div
elements_with_class = soup.find_all('div', class_='item')  # class为item的所有div

# 提取属性值
link = soup.find('a')
href = link['href']  # 获取href属性值

# 导航文档树
parent = link.parent  # 父元素
children = div_with_id.children  # 子元素
siblings = link.next_siblings  # 兄弟元素

4.3 XPath 与 lxml

lxml库提供了XPath支持,适用于复杂的解析需求:

python

from lxml import html
import requests

# 获取网页内容
response = requests.get('https://example.com')
html_content = response.text

# 创建HTML树
tree = html.fromstring(html_content)

# 使用XPath选择元素
# 选择所有h1标签
h1_elements = tree.xpath('//h1')

# 选择class为title的元素
title_elements = tree.xpath('//*[@class="title"]')

# 选择包含特定文本的元素
specific_text = tree.xpath('//p[contains(text(), "特定文本")]')

# 提取属性
links = tree.xpath('//a/@href')  # 所有链接的href属性

# 复杂XPath示例
# 选择id为content的div下的所有p标签
paragraphs = tree.xpath('//div[@id="content"]//p')

for p in paragraphs:
    print(p.text_content())  # 获取元素文本内容

4.4 正则表达式在数据采集中的应用

正则表达式适合提取特定模式的文本:

python

import re

text = "联系电话: 123-456-7890, 邮箱: example@email.com"

# 提取电话号码
phone_pattern = r'\d{3}-\d{3}-\d{4}'
phones = re.findall(phone_pattern, text)
print(phones)  # ['123-456-7890']

# 提取邮箱地址
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
print(emails)  # ['example@email.com']

# 替换文本
anonymized_text = re.sub(phone_pattern, 'XXX-XXX-XXXX', text)
print(anonymized_text)  # "联系电话: XXX-XXX-XXXX, 邮箱: example@email.com"

# 分割文本
sentence = "数据1,数据2;数据3|数据4"
split_result = re.split(r'[,;|]', sentence)
print(split_result)  # ['数据1', '数据2', '数据3', '数据4']

4.5 解析策略对比

解析方法优点缺点适用场景
BeautifulSoup简单易用,容错性好速度相对较慢简单网页,快速开发
lxml + XPath解析速度快,表达能力强学习曲线较陡复杂网页,高性能需求
正则表达式灵活,模式匹配强大可读性差,维护困难提取特定模式文本

5. 数据存储方案 <a name="数据存储方案"></a>

5.1 文件存储

CSV 文件存储

python

import csv
import requests
from bs4 import BeautifulSoup

# 采集数据
response = requests.get('https://example.com/books')
soup = BeautifulSoup(response.text, 'lxml')

books = []
for item in soup.select('.book-item'):
    title = item.select_one('.title').text.strip()
    author = item.select_one('.author').text.strip()
    price = item.select_one('.price').text.strip()
    books.append([title, author, price])

# 写入CSV文件
with open('books.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['书名', '作者', '价格'])  # 写入表头
    writer.writerows(books)  # 写入数据

# 读取CSV文件
with open('books.csv', 'r', encoding='utf-8') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)
JSON 文件存储

python

import json
import requests
from bs4 import BeautifulSoup

# 采集数据
response = requests.get('https://example.com/books')
soup = BeautifulSoup(response.text, 'lxml')

books = []
for item in soup.select('.book-item'):
    book = {
        'title': item.select_one('.title').text.strip(),
        'author': item.select_one('.author').text.strip(),
        'price': item.select_one('.price').text.strip()
    }
    books.append(book)

# 写入JSON文件
with open('books.json', 'w', encoding='utf-8') as file:
    json.dump(books, file, ensure_ascii=False, indent=2)

# 读取JSON文件
with open('books.json', 'r', encoding='utf-8') as file:
    books_data = json.load(file)
    for book in books_data:
        print(book['title'], book['author'])

5.2 数据库存储

SQLite 数据库

python

import sqlite3
import requests
from bs4 import BeautifulSoup

# 创建数据库连接
conn = sqlite3.connect('books.db')
cursor = conn.cursor()

# 创建表
cursor.execute('''
CREATE TABLE IF NOT EXISTS books (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT NOT NULL,
    author TEXT NOT NULL,
    price REAL NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
''')

# 采集数据
response = requests.get('https://example.com/books')
soup = BeautifulSoup(response.text, 'lxml')

for item in soup.select('.book-item'):
    title = item.select_one('.title').text.strip()
    author = item.select_one('.author').text.strip()
    price = float(item.select_one('.price').text.strip().replace('¥', ''))
    
    # 插入数据
    cursor.execute('INSERT INTO books (title, author, price) VALUES (?, ?, ?)', 
                  (title, author, price))

# 提交事务并关闭连接
conn.commit()
conn.close()
MySQL 数据库

python

import mysql.connector
from mysql.connector import Error
import requests
from bs4 import BeautifulSoup

try:
    # 创建数据库连接
    connection = mysql.connector.connect(
        host='localhost',
        database='web_scraping',
        user='username',
        password='password'
    )
    
    if connection.is_connected():
        cursor = connection.cursor()
        
        # 创建表
        cursor.execute('''
        CREATE TABLE IF NOT EXISTS books (
            id INT AUTO_INCREMENT PRIMARY KEY,
            title VARCHAR(255) NOT NULL,
            author VARCHAR(255) NOT NULL,
            price DECIMAL(10, 2) NOT NULL,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
        ''')
        
        # 采集数据
        response = requests.get('https://example.com/books')
        soup = BeautifulSoup(response.text, 'lxml')
        
        for item in soup.select('.book-item'):
            title = item.select_one('.title').text.strip()
            author = item.select_one('.author').text.strip()
            price = float(item.select_one('.price').text.strip().replace('¥', ''))
            
            # 插入数据
            cursor.execute('INSERT INTO books (title, author, price) VALUES (%s, %s, %s)', 
                          (title, author, price))
        
        connection.commit()
        
except Error as e:
    print(f"数据库错误: {e}")
finally:
    if connection.is_connected():
        cursor.close()
        connection.close()

5.3 NoSQL 数据库

MongoDB 存储

python

from pymongo import MongoClient
import requests
from bs4 import BeautifulSoup

# 连接MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['web_scraping']
collection = db['books']

# 采集数据
response = requests.get('https://example.com/books')
soup = BeautifulSoup(response.text, 'lxml')

books = []
for item in soup.select('.book-item'):
    book = {
        'title': item.select_one('.title').text.strip(),
        'author': item.select_one('.author').text.strip(),
        'price': item.select_one('.price').text.strip()
    }
    books.append(book)

# 批量插入数据
if books:
    result = collection.insert_many(books)
    print(f"插入了 {len(result.inserted_ids)} 条文档")

# 查询数据
for book in collection.find({'price': {'$gt': '¥50'}}):
    print(book)

# 关闭连接
client.close()

5.4 数据存储方案对比

存储方式优点缺点适用场景
CSV文件简单通用,易于查看不适合复杂数据结构小型项目,数据交换
JSON文件保持数据结构,可读性好文件较大时效率低配置数据,简单数据结构
SQLite无需服务器,轻量级并发性能有限桌面应用,小型项目
MySQL功能丰富,性能良好需要单独服务器中大型项目,Web应用
MongoDB灵活的模式,易扩展内存占用较高非结构化数据,快速迭代

6. 高级采集技术 <a name="高级采集技术"></a>

6.1 处理 JavaScript 渲染的页面

许多现代网站使用JavaScript动态加载内容,需要使用浏览器自动化工具:

使用 Selenium

python

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import time

# 配置Chrome选项
chrome_options = Options()
chrome_options.add_argument('--headless')  # 无头模式
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')

# 初始化浏览器驱动
driver = webdriver.Chrome(options=chrome_options)

try:
    # 打开网页
    driver.get('https://example.com/dynamic-content')
    
    # 等待特定元素加载完成
    wait = WebDriverWait(driver, 10)
    element = wait.until(
        EC.presence_of_element_located((By.CLASS_NAME, "dynamic-content"))
    )
    
    # 交互操作:点击按钮
    button = driver.find_element(By.ID, 'load-more')
    button.click()
    
    # 等待新内容加载
    time.sleep(2)
    
    # 获取页面源码
    page_source = driver.page_source
    
    # 使用BeautifulSoup解析
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(page_source, 'lxml')
    
    # 提取数据
    items = soup.select('.item')
    for item in items:
        print(item.text)
        
finally:
    # 关闭浏览器
    driver.quit()
使用 Requests-HTML

python

from requests_html import HTMLSession

session = HTMLSession()

# 渲染JavaScript
response = session.get('https://example.com/dynamic-content')
response.html.render(sleep=2, timeout=20)

# 提取数据
items = response.html.find('.item')
for item in items:
    print(item.text)

# 关闭会话
session.close()

6.2 处理分页和无限滚动

python

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

def scrape_paginated_content():
    driver = webdriver.Chrome()
    all_data = []
    
    try:
        driver.get('https://example.com/paginated-data')
        
        page_number = 1
        while True:
            print(f"正在采集第 {page_number} 页...")
            
            # 等待内容加载
            wait = WebDriverWait(driver, 10)
            wait.until(EC.presence_of_element_located((By.CLASS_NAME, "item")))
            
            # 解析当前页内容
            soup = BeautifulSoup(driver.page_source, 'lxml')
            items = soup.select('.item')
            
            for item in items:
                # 提取数据并添加到all_data
                data = extract_item_data(item)
                all_data.append(data)
            
            # 检查是否有下一页
            next_button = driver.find_elements(By.CSS_SELECTOR, '.next-page')
            if not next_button or 'disabled' in next_button[0].get_attribute('class'):
                break
                
            # 点击下一页
            next_button[0].click()
            page_number += 1
            time.sleep(2)  # 等待页面加载
            
    finally:
        driver.quit()
    
    return all_data

def extract_item_data(item):
    # 实现具体的数据提取逻辑
    title = item.select_one('.title').text.strip()
    price = item.select_one('.price').text.strip()
    return {'title': title, 'price': price}

6.3 使用代理和轮换 User-Agent

python

import requests
from fake_useragent import UserAgent
import random
import time

# 代理池
proxies = [
    'http://proxy1.com:8080',
    'http://proxy2.com:8080',
    'http://proxy3.com:8080',
]

# 创建User-Agent生成器
ua = UserAgent()

def get_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            # 随机选择代理和User-Agent
            proxy = {'http': random.choice(proxies)}
            headers = {'User-Agent': ua.random}
            
            response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
            response.raise_for_status()
            return response
            
        except requests.RequestException as e:
            print(f"尝试 {attempt + 1} 失败: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # 指数退避
            else:
                raise

# 使用示例
try:
    response = get_with_retry('https://example.com')
    print("请求成功")
except Exception as e:
    print(f"所有尝试都失败了: {e}")

6.4 异步数据采集

使用 asyncio 和 aiohttp 提高采集效率:

python

import aiohttp
import asyncio
from bs4 import BeautifulSoup
import time

async def fetch_page(session, url):
    try:
        async with session.get(url) as response:
            return await response.text()
    except Exception as e:
        print(f"获取 {url} 时出错: {e}")
        return None

async def parse_page(content):
    if not content:
        return []
    
    soup = BeautifulSoup(content, 'lxml')
    items = soup.select('.item')
    data = []
    
    for item in items:
        title = item.select_one('.title').text.strip()
        price = item.select_one('.price').text.strip()
        data.append({'title': title, 'price': price})
    
    return data

async def scrape_urls(urls):
    connector = aiohttp.TCPConnector(limit=10)  # 限制并发数
    timeout = aiohttp.ClientTimeout(total=30)
    
    async with aiohttp.ClientSession(connector=connector, timeout=timeout) as session:
        tasks = []
        for url in urls:
            task = asyncio.create_task(fetch_page(session, url))
            tasks.append(task)
        
        contents = await asyncio.gather(*tasks)
        
        parsing_tasks = []
        for content in contents:
            parsing_tasks.append(asyncio.create_task(parse_page(content)))
        
        results = await asyncio.gather(*parsing_tasks)
        
        # 合并所有结果
        all_data = []
        for result in results:
            all_data.extend(result)
        
        return all_data

# 使用示例
urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
    # ... 更多URL
]

start_time = time.time()
results = asyncio.run(scrape_urls(urls))
end_time = time.time()

print(f"采集了 {len(results)} 条数据,耗时 {end_time - start_time:.2f} 秒")

7. 数据清洗与处理 <a name="数据清洗与处理"></a>

7.1 数据清洗技术

采集的数据往往需要清洗和预处理:

python

import pandas as pd
import numpy as np
import re
from datetime import datetime

# 示例数据
data = [
    {'title': ' Python编程入门 ', 'price': '¥99.00', 'date': '2023-01-15'},
    {'title': '数据科学实战', 'price': '150元', 'date': '2023/02/20'},
    {'title': '机器学习', 'price': '200', 'date': '无效日期'},
    {'title': 'Web开发', 'price': '¥120.50', 'date': '2023-03-10'},
]

# 创建DataFrame
df = pd.DataFrame(data)

# 清洗标题:去除前后空格
df['title'] = df['title'].str.strip()

# 清洗价格:提取数字
def clean_price(price):
    if isinstance(price, str):
        # 提取数字和小数点
        numbers = re.findall(r'\d+\.?\d*', price)
        if numbers:
            return float(numbers[0])
    return np.nan

df['price_clean'] = df['price'].apply(clean_price)

# 清洗日期
def clean_date(date_str):
    try:
        # 尝试不同日期格式
        for fmt in ('%Y-%m-%d', '%Y/%m/%d', '%d-%m-%Y', '%d/%m/%Y'):
            try:
                return datetime.strptime(date_str, fmt).date()
            except ValueError:
                continue
        return np.nan
    except:
        return np.nan

df['date_clean'] = df['date'].apply(clean_date)

print("原始数据:")
print(df[['title', 'price', 'date']])
print("\n清洗后的数据:")
print(df[['title', 'price_clean', 'date_clean']])

7.2 数据转换与标准化

python

# 继续使用上面的DataFrame

# 处理缺失值
print("缺失值统计:")
print(df.isnull().sum())

# 填充缺失值
df['price_clean'] = df['price_clean'].fillna(df['price_clean'].median())
df['date_clean'] = df['date_clean'].fillna(pd.Timestamp('today').date())

# 数据类型转换
df['price_clean'] = df['price_clean'].astype(float)

# 创建新特征
df['price_category'] = pd.cut(df['price_clean'], 
                             bins=[0, 100, 150, 200, np.inf],
                             labels=['便宜', '中等', '较贵', '昂贵'])

# 字符串操作
df['title_length'] = df['title'].str.len()
df['has_python'] = df['title'].str.contains('Python', case=False)

print("\n处理后的数据:")
print(df)

7.3 数据去重与验证

python

# 数据去重
print(f"去重前数据量: {len(df)}")

# 基于标题去重
df_deduplicated = df.drop_duplicates(subset=['title'])
print(f"去重后数据量: {len(df_deduplicated)}")

# 数据验证
def validate_row(row):
    errors = []
    
    # 检查价格是否合理
    if row['price_clean'] <= 0 or row['price_clean'] > 1000:
        errors.append(f"价格 {row['price_clean']} 不合理")
    
    # 检查日期是否在未来
    if pd.notna(row['date_clean']) and row['date_clean'] > pd.Timestamp('today').date():
        errors.append(f"日期 {row['date_clean']} 是未来日期")
    
    return errors if errors else None

# 应用验证
df['validation_errors'] = df.apply(validate_row, axis=1)

# 显示有错误的数据
invalid_data = df[df['validation_errors'].notna()]
print(f"发现 {len(invalid_data)} 条无效数据")
for index, row in invalid_data.iterrows():
    print(f"行 {index}: {row['validation_errors']}")

8. 实战项目 <a name="实战项目"></a>

8.1 电商网站价格监控

python

import requests
from bs4 import BeautifulSoup
import smtplib
from email.mime.text import MIMEText
import time
import schedule

class PriceMonitor:
    def __init__(self, url, target_price, email_settings):
        self.url = url
        self.target_price = target_price
        self.email_settings = email_settings
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
    
    def get_current_price(self):
        try:
            response = requests.get(self.url, headers=self.headers, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'lxml')
            
            # 根据实际网站结构调整选择器
            price_element = soup.select_one('.product-price, .price, [itemprop="price"]')
            if price_element:
                price_text = price_element.get_text().strip()
                # 提取数字
                import re
                price = float(re.search(r'\d+\.?\d*', price_text).group())
                return price
            
            return None
            
        except Exception as e:
            print(f"获取价格时出错: {e}")
            return None
    
    def send_email_alert(self, current_price):
        msg = MIMEText(f"""
        价格警报!
        
        商品链接: {self.url}
        当前价格: ¥{current_price}
        目标价格: ¥{self.target_price}
        
        当前价格已低于或等于您的目标价格!
        """)
        
        msg['Subject'] = '价格警报: 商品价格下降!'
        msg['From'] = self.email_settings['from_email']
        msg['To'] = self.email_settings['to_email']
        
        try:
            with smtplib.SMTP(self.email_settings['smtp_server'], self.email_settings['smtp_port']) as server:
                server.starttls()
                server.login(self.email_settings['username'], self.email_settings['password'])
                server.send_message(msg)
            print("警报邮件已发送")
        except Exception as e:
            print(f"发送邮件时出错: {e}")
    
    def check_price(self):
        print(f"检查价格... {time.strftime('%Y-%m-%d %H:%M:%S')}")
        current_price = self.get_current_price()
        
        if current_price is not None:
            print(f"当前价格: ¥{current_price}")
            
            if current_price <= self.target_price:
                self.send_email_alert(current_price)
                return True
        
        return False
    
    def run_monitor(self, check_interval_hours=1):
        print(f"开始监控,每 {check_interval_hours} 小时检查一次")
        schedule.every(check_interval_hours).hours.do(self.check_price)
        
        # 立即检查一次
        self.check_price()
        
        while True:
            schedule.run_pending()
            time.sleep(60)  # 每分钟检查一次任务

# 使用示例
if __name__ == "__main__":
    # 配置
    product_url = "https://example.com/product/123"
    target_price = 100.0
    email_settings = {
        'smtp_server': 'smtp.gmail.com',
        'smtp_port': 587,
        'username': 'your_email@gmail.com',
        'password': 'your_password',
        'from_email': 'your_email@gmail.com',
        'to_email': 'recipient@example.com'
    }
    
    monitor = PriceMonitor(product_url, target_price, email_settings)
    monitor.run_monitor(check_interval_hours=2)

8.2 新闻网站内容聚合

python

import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
from datetime import datetime
import time

class NewsAggregator:
    def __init__(self):
        self.sources = {
            'source1': {
                'url': 'https://news-source1.com/latest',
                'article_selector': '.article',
                'title_selector': '.title',
                'summary_selector': '.summary',
                'date_selector': '.publish-date',
                'link_selector': 'a[href]'
            },
            'source2': {
                'url': 'https://news-source2.com/news',
                'article_selector': '.news-item',
                'title_selector': 'h2',
                'summary_selector': '.description',
                'date_selector': '.time',
                'link_selector': 'a'
            }
            # 可以添加更多新闻源
        }
        
        self.articles = []
    
    def scrape_source(self, source_name, source_config):
        try:
            print(f"正在采集 {source_name}...")
            response = requests.get(source_config['url'], timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'lxml')
            article_elements = soup.select(source_config['article_selector'])
            
            for article in article_elements:
                try:
                    title_elem = article.select_one(source_config['title_selector'])
                    summary_elem = article.select_one(source_config['summary_selector'])
                    date_elem = article.select_one(source_config['date_selector'])
                    link_elem = article.select_one(source_config['link_selector'])
                    
                    if title_elem and link_elem:
                        article_data = {
                            'source': source_name,
                            'title': title_elem.get_text().strip(),
                            'summary': summary_elem.get_text().strip() if summary_elem else '',
                            'date': date_elem.get_text().strip() if date_elem else '',
                            'link': link_elem['href'] if link_elem and 'href' in link_elem.attrs else '',
                            'scraped_at': datetime.now().isoformat()
                        }
                        
                        # 确保链接是绝对URL
                        if article_data['link'] and not article_data['link'].startswith('http'):
                            article_data['link'] = source_config['url'] + article_data['link']
                        
                        self.articles.append(article_data)
                        
                except Exception as e:
                    print(f"处理文章时出错: {e}")
                    continue
                    
        except Exception as e:
            print(f"采集 {source_name} 时出错: {e}")
    
    def scrape_all_sources(self):
        print("开始采集所有新闻源...")
        self.articles = []
        
        for source_name, source_config in self.sources.items():
            self.scrape_source(source_name, source_config)
            time.sleep(1)  # 礼貌延迟
        
        print(f"采集完成,共获取 {len(self.articles)} 篇文章")
    
    def save_to_json(self, filename):
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(self.articles, f, ensure_ascii=False, indent=2)
        print(f"数据已保存到 {filename}")
    
    def save_to_csv(self, filename):
        df = pd.DataFrame(self.articles)
        df.to_csv(filename, index=False, encoding='utf-8')
        print(f"数据已保存到 {filename}")
    
    def analyze_articles(self):
        df = pd.DataFrame(self.articles)
        
        if df.empty:
            print("没有数据可分析")
            return
        
        print("\n=== 数据分析 ===")
        print(f"总文章数: {len(df)}")
        print("\n各来源文章数量:")
        print(df['source'].value_counts())
        
        # 时间分析(如果有日期信息)
        if 'date' in df.columns and not df['date'].empty:
            # 这里可以添加日期处理和分析逻辑
            pass
        
        return df

# 使用示例
if __name__ == "__main__":
    aggregator = NewsAggregator()
    aggregator.scrape_all_sources()
    
    if aggregator.articles:
        aggregator.save_to_json('news_articles.json')
        aggregator.save_to_csv('news_articles.csv')
        
        df = aggregator.analyze_articles()
        print("\n前5篇文章:")
        print(df[['source', 'title', 'date']].head())

9. 道德与法律考量 <a name="道德与法律考量"></a>

9.1 合法数据采集的重要原则

原则说明实践建议
尊重 robots.txt遵守网站的爬虫协议检查并遵守目标网站的robots.txt
控制访问频率避免对网站造成负担添加延迟,限制并发请求
标识爬虫身份诚实地标识爬虫使用合适的User-Agent
尊重版权不侵犯内容版权仅采集必要数据,注明来源
保护隐私不收集个人信息避免采集邮箱、电话等敏感信息

9.2 robots.txt 遵守示例

python

import requests
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
import time

def check_robots_permission(url, user_agent='*'):
    """检查给定URL是否允许爬取"""
    try:
        # 解析URL获取基础URL
        parsed_url = urlparse(url)
        base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
        
        # 获取robots.txt
        robots_url = f"{base_url}/robots.txt"
        rp = RobotFileParser()
        rp.set_url(robots_url)
        rp.read()
        
        # 检查权限
        return rp.can_fetch(user_agent, url)
    
    except Exception as e:
        print(f"检查robots.txt时出错: {e}")
        return False

def respectful_crawler(url, user_agent='MyCrawler/1.0'):
    """遵守robots.txt的礼貌爬虫"""
    if not check_robots_permission(url, user_agent):
        print(f"根据robots.txt,不允许爬取: {url}")
        return None
    
    try:
        headers = {'User-Agent': user_agent}
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        # 礼貌延迟
        time.sleep(1)
        
        return response.text
    
    except Exception as e:
        print(f"爬取 {url} 时出错: {e}")
        return None

# 使用示例
url = 'https://example.com/some-page'
content = respectful_crawler(url)
if content:
    print("成功获取内容")
    # 处理内容...

10. 总结与资源 <a name="总结与资源"></a>

10.1 数据采集最佳实践总结

实践领域最佳实践
代码结构模块化设计,函数职责单一
错误处理全面的异常捕获和处理
性能优化异步请求,合理缓存
可维护性清晰注释,配置文件管理
遵守规则尊重robots.txt,控制频率

10.2 推荐学习资源

资源类型推荐内容
官方文档Requests, BeautifulSoup, Scrapy, Selenium
在线课程Coursera, Udemy, 慕课网的数据采集课程
书籍《Python网络数据采集》、《用Python写网络爬虫》
社区Stack Overflow, GitHub, 知乎相关话题
工具Postman, Chrome DevTools, Scrapy Cloud

10.3 常见问题与解决方案

问题解决方案
被封IP使用代理IP池,降低请求频率
动态内容使用Selenium或Requests-HTML
验证码使用验证码识别服务或手动处理
登录认证使用会话保持cookies,处理token
反爬机制随机User-Agent,模拟人类行为

通过本指南,您应该已经掌握了Python数据采集从基础到高级的全套技能。记住,技术只是工具,如何使用这些工具才是关键。始终遵循道德和法律准则,负责任地进行数据采集。

Happy Scraping!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值