目录
-
数据采集概述
-
Python 环境配置
-
HTTP 协议基础
-
网页解析技术
-
数据存储方案
-
高级采集技术
-
数据清洗与处理
-
实战项目
-
道德与法律考量
-
总结与资源
1. 数据采集概述 <a name="数据采集概述"></a>
1.1 什么是数据采集
数据采集(Web Scraping)是指通过自动化程序从网站提取信息的过程。与手动复制粘贴相比,自动化数据采集可以高效地获取大量结构化数据,为数据分析、市场研究和机器学习提供数据源。
1.2 为什么使用 Python 进行数据采集
Python 成为数据采集的首选语言原因如下:
特性 | 说明 |
---|---|
丰富的库生态系统 | Requests, BeautifulSoup, Scrapy, Selenium 等 |
简单易学的语法 | 代码可读性高,学习曲线平缓 |
强大的数据处理能力 | Pandas, NumPy 等库便于后续数据处理 |
跨平台兼容性 | 可在Windows、MacOS和Linux上运行 |
社区支持 | 庞大的开发者社区,丰富的学习资源 |
1.3 数据采集的法律与道德考量
在进行数据采集前,必须了解相关法律和道德准则:
注意事项 | 说明 |
---|---|
robots.txt | 遵守网站的robots.txt文件规定 |
服务条款 | 尊重网站的使用条款 |
访问频率 | 合理控制请求频率,避免对网站造成负担 |
数据用途 | 明确数据用途,尊重版权和隐私 |
身份标识 | 使用适当的User-Agent标识爬虫身份 |
2. Python 环境配置 <a name="Python-环境配置"></a>
2.1 Python 安装与配置
数据采集需要安装Python及相关库,以下是推荐的环境配置:
组件 | 版本 | 说明 |
---|---|---|
Python | 3.8+ | 建议使用最新稳定版 |
pip | 最新版 | Python包管理工具 |
virtualenv | 最新版 | 创建隔离的Python环境 |
安装步骤:
-
访问 Python官网 下载并安装Python
-
验证安装:在终端/CMD中输入
python --version
-
更新pip:
pip install --upgrade pip
-
安装virtualenv:
pip install virtualenv
2.2 创建虚拟环境
使用虚拟环境可以避免包冲突:
bash
# 创建虚拟环境 python -m venv scraping_env # 激活虚拟环境 (Windows) scraping_env\Scripts\activate # 激活虚拟环境 (MacOS/Linux) source scraping_env/bin/activate
2.3 安装必要库
以下是数据采集所需的核心库:
库名称 | 用途 | 安装命令 |
---|---|---|
requests | 发送HTTP请求 | pip install requests |
BeautifulSoup4 | HTML解析 | pip install beautifulsoup4 |
lxml | 快速XML/HTML解析 | pip install lxml |
selenium | 浏览器自动化 | pip install selenium |
scrapy | 爬虫框架 | pip install scrapy |
pandas | 数据处理与分析 | pip install pandas |
numpy | 数值计算 | pip install numpy |
2.4 开发工具推荐
工具类型 | 推荐工具 | 特点 |
---|---|---|
IDE | PyCharm | 强大的Python专用IDE |
文本编辑器 | VS Code | 轻量级,插件丰富 |
浏览器工具 | Chrome DevTools | 分析网页结构,调试爬虫 |
API测试 | Postman | 测试API接口 |
3. HTTP 协议基础 <a name="HTTP-协议基础"></a>
3.1 HTTP 请求与响应
HTTP(超文本传输协议)是数据采集的基础,了解其工作原理至关重要:
组件 | 说明 |
---|---|
请求方法 | GET, POST, PUT, DELETE等 |
状态码 | 200(成功), 404(未找到), 500(服务器错误)等 |
请求头 | User-Agent, Cookie, Referer等 |
响应头 | Content-Type, Set-Cookie等 |
请求体 | POST请求中发送的数据 |
3.2 常用HTTP状态码
状态码 | 含义 | 常见场景 |
---|---|---|
200 | OK | 请求成功 |
301 | Moved Permanently | 永久重定向 |
302 | Found | 临时重定向 |
400 | Bad Request | 错误请求 |
403 | Forbidden | 禁止访问 |
404 | Not Found | 页面不存在 |
500 | Internal Server Error | 服务器内部错误 |
503 | Service Unavailable | 服务不可用 |
3.3 使用 Requests 库发送HTTP请求
Requests是Python中最常用的HTTP库,简单易用:
python
import requests # 发送GET请求 response = requests.get('https://api.example.com/data') # 检查请求是否成功 if response.status_code == 200: print('请求成功!') print(response.text) # 响应内容 else: print(f'请求失败,状态码: {response.status_code}') # 发送带参数的GET请求 params = {'key1': 'value1', 'key2': 'value2'} response = requests.get('https://api.example.com/data', params=params) # 发送POST请求 data = {'username': 'user', 'password': 'pass'} response = requests.post('https://api.example.com/login', data=data) # 设置请求头 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Accept': 'application/json' } response = requests.get('https://api.example.com/data', headers=headers)
3.4 处理 cookies 和会话
python
import requests # 创建会话对象维持cookies session = requests.Session() # 首先登录获取cookies login_data = {'username': 'user', 'password': 'pass'} session.post('https://example.com/login', data=login_data) # 使用已有cookies访问需要认证的页面 response = session.get('https://example.com/dashboard') print(response.text) # 手动处理cookies response = requests.get('https://example.com') cookies = response.cookies # 使用获取的cookies发送后续请求 response = requests.get('https://example.com/protected', cookies=cookies)
3.5 处理异常和超时
python
import requests from requests.exceptions import RequestException try: # 设置超时时间(连接超时和读取超时) response = requests.get('https://example.com', timeout=(3.05, 27)) # 抛出HTTP错误状态码异常 response.raise_for_status() print(response.text) except RequestException as e: print(f'请求错误: {e}') except Timeout: print('请求超时') except ConnectionError: print('连接错误') except HTTPError as e: print(f'HTTP错误: {e}')
4. 网页解析技术 <a name="网页解析技术"></a>
4.1 HTML 基础结构
了解HTML结构是解析网页的前提:
html
<!DOCTYPE html> <html> <head> <title>网页标题</title> </head> <body> <div id="content"> <h1 class="title">主标题</h1> <p class="text">段落文本</p> <ul> <li>列表项1</li> <li>列表项2</li> </ul> <a href="https://example.com">链接</a> </div> </body> </html>
4.2 使用 BeautifulSoup 解析HTML
BeautifulSoup是Python中最流行的HTML解析库:
python
from bs4 import BeautifulSoup import requests # 获取网页内容 response = requests.get('https://example.com') html_content = response.text # 创建BeautifulSoup对象 soup = BeautifulSoup(html_content, 'lxml') # 或者使用 'html.parser' # 通过标签名查找元素 title = soup.title # 获取<title>标签 title_text = soup.title.text # 获取<title>标签的文本 # 通过CSS选择器查找元素 first_paragraph = soup.select_one('p') # 第一个<p>标签 all_paragraphs = soup.select('p') # 所有<p>标签 # 通过属性查找元素 div_with_id = soup.find('div', id='content') # id为content的div elements_with_class = soup.find_all('div', class_='item') # class为item的所有div # 提取属性值 link = soup.find('a') href = link['href'] # 获取href属性值 # 导航文档树 parent = link.parent # 父元素 children = div_with_id.children # 子元素 siblings = link.next_siblings # 兄弟元素
4.3 XPath 与 lxml
lxml库提供了XPath支持,适用于复杂的解析需求:
python
from lxml import html import requests # 获取网页内容 response = requests.get('https://example.com') html_content = response.text # 创建HTML树 tree = html.fromstring(html_content) # 使用XPath选择元素 # 选择所有h1标签 h1_elements = tree.xpath('//h1') # 选择class为title的元素 title_elements = tree.xpath('//*[@class="title"]') # 选择包含特定文本的元素 specific_text = tree.xpath('//p[contains(text(), "特定文本")]') # 提取属性 links = tree.xpath('//a/@href') # 所有链接的href属性 # 复杂XPath示例 # 选择id为content的div下的所有p标签 paragraphs = tree.xpath('//div[@id="content"]//p') for p in paragraphs: print(p.text_content()) # 获取元素文本内容
4.4 正则表达式在数据采集中的应用
正则表达式适合提取特定模式的文本:
python
import re text = "联系电话: 123-456-7890, 邮箱: example@email.com" # 提取电话号码 phone_pattern = r'\d{3}-\d{3}-\d{4}' phones = re.findall(phone_pattern, text) print(phones) # ['123-456-7890'] # 提取邮箱地址 email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' emails = re.findall(email_pattern, text) print(emails) # ['example@email.com'] # 替换文本 anonymized_text = re.sub(phone_pattern, 'XXX-XXX-XXXX', text) print(anonymized_text) # "联系电话: XXX-XXX-XXXX, 邮箱: example@email.com" # 分割文本 sentence = "数据1,数据2;数据3|数据4" split_result = re.split(r'[,;|]', sentence) print(split_result) # ['数据1', '数据2', '数据3', '数据4']
4.5 解析策略对比
解析方法 | 优点 | 缺点 | 适用场景 |
---|---|---|---|
BeautifulSoup | 简单易用,容错性好 | 速度相对较慢 | 简单网页,快速开发 |
lxml + XPath | 解析速度快,表达能力强 | 学习曲线较陡 | 复杂网页,高性能需求 |
正则表达式 | 灵活,模式匹配强大 | 可读性差,维护困难 | 提取特定模式文本 |
5. 数据存储方案 <a name="数据存储方案"></a>
5.1 文件存储
CSV 文件存储
python
import csv import requests from bs4 import BeautifulSoup # 采集数据 response = requests.get('https://example.com/books') soup = BeautifulSoup(response.text, 'lxml') books = [] for item in soup.select('.book-item'): title = item.select_one('.title').text.strip() author = item.select_one('.author').text.strip() price = item.select_one('.price').text.strip() books.append([title, author, price]) # 写入CSV文件 with open('books.csv', 'w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(['书名', '作者', '价格']) # 写入表头 writer.writerows(books) # 写入数据 # 读取CSV文件 with open('books.csv', 'r', encoding='utf-8') as file: reader = csv.reader(file) for row in reader: print(row)
JSON 文件存储
python
import json import requests from bs4 import BeautifulSoup # 采集数据 response = requests.get('https://example.com/books') soup = BeautifulSoup(response.text, 'lxml') books = [] for item in soup.select('.book-item'): book = { 'title': item.select_one('.title').text.strip(), 'author': item.select_one('.author').text.strip(), 'price': item.select_one('.price').text.strip() } books.append(book) # 写入JSON文件 with open('books.json', 'w', encoding='utf-8') as file: json.dump(books, file, ensure_ascii=False, indent=2) # 读取JSON文件 with open('books.json', 'r', encoding='utf-8') as file: books_data = json.load(file) for book in books_data: print(book['title'], book['author'])
5.2 数据库存储
SQLite 数据库
python
import sqlite3 import requests from bs4 import BeautifulSoup # 创建数据库连接 conn = sqlite3.connect('books.db') cursor = conn.cursor() # 创建表 cursor.execute(''' CREATE TABLE IF NOT EXISTS books ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, author TEXT NOT NULL, price REAL NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) ''') # 采集数据 response = requests.get('https://example.com/books') soup = BeautifulSoup(response.text, 'lxml') for item in soup.select('.book-item'): title = item.select_one('.title').text.strip() author = item.select_one('.author').text.strip() price = float(item.select_one('.price').text.strip().replace('¥', '')) # 插入数据 cursor.execute('INSERT INTO books (title, author, price) VALUES (?, ?, ?)', (title, author, price)) # 提交事务并关闭连接 conn.commit() conn.close()
MySQL 数据库
python
import mysql.connector from mysql.connector import Error import requests from bs4 import BeautifulSoup try: # 创建数据库连接 connection = mysql.connector.connect( host='localhost', database='web_scraping', user='username', password='password' ) if connection.is_connected(): cursor = connection.cursor() # 创建表 cursor.execute(''' CREATE TABLE IF NOT EXISTS books ( id INT AUTO_INCREMENT PRIMARY KEY, title VARCHAR(255) NOT NULL, author VARCHAR(255) NOT NULL, price DECIMAL(10, 2) NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) ''') # 采集数据 response = requests.get('https://example.com/books') soup = BeautifulSoup(response.text, 'lxml') for item in soup.select('.book-item'): title = item.select_one('.title').text.strip() author = item.select_one('.author').text.strip() price = float(item.select_one('.price').text.strip().replace('¥', '')) # 插入数据 cursor.execute('INSERT INTO books (title, author, price) VALUES (%s, %s, %s)', (title, author, price)) connection.commit() except Error as e: print(f"数据库错误: {e}") finally: if connection.is_connected(): cursor.close() connection.close()
5.3 NoSQL 数据库
MongoDB 存储
python
from pymongo import MongoClient import requests from bs4 import BeautifulSoup # 连接MongoDB client = MongoClient('mongodb://localhost:27017/') db = client['web_scraping'] collection = db['books'] # 采集数据 response = requests.get('https://example.com/books') soup = BeautifulSoup(response.text, 'lxml') books = [] for item in soup.select('.book-item'): book = { 'title': item.select_one('.title').text.strip(), 'author': item.select_one('.author').text.strip(), 'price': item.select_one('.price').text.strip() } books.append(book) # 批量插入数据 if books: result = collection.insert_many(books) print(f"插入了 {len(result.inserted_ids)} 条文档") # 查询数据 for book in collection.find({'price': {'$gt': '¥50'}}): print(book) # 关闭连接 client.close()
5.4 数据存储方案对比
存储方式 | 优点 | 缺点 | 适用场景 |
---|---|---|---|
CSV文件 | 简单通用,易于查看 | 不适合复杂数据结构 | 小型项目,数据交换 |
JSON文件 | 保持数据结构,可读性好 | 文件较大时效率低 | 配置数据,简单数据结构 |
SQLite | 无需服务器,轻量级 | 并发性能有限 | 桌面应用,小型项目 |
MySQL | 功能丰富,性能良好 | 需要单独服务器 | 中大型项目,Web应用 |
MongoDB | 灵活的模式,易扩展 | 内存占用较高 | 非结构化数据,快速迭代 |
6. 高级采集技术 <a name="高级采集技术"></a>
6.1 处理 JavaScript 渲染的页面
许多现代网站使用JavaScript动态加载内容,需要使用浏览器自动化工具:
使用 Selenium
python
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.chrome.options import Options import time # 配置Chrome选项 chrome_options = Options() chrome_options.add_argument('--headless') # 无头模式 chrome_options.add_argument('--disable-gpu') chrome_options.add_argument('--no-sandbox') # 初始化浏览器驱动 driver = webdriver.Chrome(options=chrome_options) try: # 打开网页 driver.get('https://example.com/dynamic-content') # 等待特定元素加载完成 wait = WebDriverWait(driver, 10) element = wait.until( EC.presence_of_element_located((By.CLASS_NAME, "dynamic-content")) ) # 交互操作:点击按钮 button = driver.find_element(By.ID, 'load-more') button.click() # 等待新内容加载 time.sleep(2) # 获取页面源码 page_source = driver.page_source # 使用BeautifulSoup解析 from bs4 import BeautifulSoup soup = BeautifulSoup(page_source, 'lxml') # 提取数据 items = soup.select('.item') for item in items: print(item.text) finally: # 关闭浏览器 driver.quit()
使用 Requests-HTML
python
from requests_html import HTMLSession session = HTMLSession() # 渲染JavaScript response = session.get('https://example.com/dynamic-content') response.html.render(sleep=2, timeout=20) # 提取数据 items = response.html.find('.item') for item in items: print(item.text) # 关闭会话 session.close()
6.2 处理分页和无限滚动
python
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from bs4 import BeautifulSoup import time def scrape_paginated_content(): driver = webdriver.Chrome() all_data = [] try: driver.get('https://example.com/paginated-data') page_number = 1 while True: print(f"正在采集第 {page_number} 页...") # 等待内容加载 wait = WebDriverWait(driver, 10) wait.until(EC.presence_of_element_located((By.CLASS_NAME, "item"))) # 解析当前页内容 soup = BeautifulSoup(driver.page_source, 'lxml') items = soup.select('.item') for item in items: # 提取数据并添加到all_data data = extract_item_data(item) all_data.append(data) # 检查是否有下一页 next_button = driver.find_elements(By.CSS_SELECTOR, '.next-page') if not next_button or 'disabled' in next_button[0].get_attribute('class'): break # 点击下一页 next_button[0].click() page_number += 1 time.sleep(2) # 等待页面加载 finally: driver.quit() return all_data def extract_item_data(item): # 实现具体的数据提取逻辑 title = item.select_one('.title').text.strip() price = item.select_one('.price').text.strip() return {'title': title, 'price': price}
6.3 使用代理和轮换 User-Agent
python
import requests from fake_useragent import UserAgent import random import time # 代理池 proxies = [ 'http://proxy1.com:8080', 'http://proxy2.com:8080', 'http://proxy3.com:8080', ] # 创建User-Agent生成器 ua = UserAgent() def get_with_retry(url, max_retries=3): for attempt in range(max_retries): try: # 随机选择代理和User-Agent proxy = {'http': random.choice(proxies)} headers = {'User-Agent': ua.random} response = requests.get(url, headers=headers, proxies=proxy, timeout=10) response.raise_for_status() return response except requests.RequestException as e: print(f"尝试 {attempt + 1} 失败: {e}") if attempt < max_retries - 1: time.sleep(2 ** attempt) # 指数退避 else: raise # 使用示例 try: response = get_with_retry('https://example.com') print("请求成功") except Exception as e: print(f"所有尝试都失败了: {e}")
6.4 异步数据采集
使用 asyncio 和 aiohttp 提高采集效率:
python
import aiohttp import asyncio from bs4 import BeautifulSoup import time async def fetch_page(session, url): try: async with session.get(url) as response: return await response.text() except Exception as e: print(f"获取 {url} 时出错: {e}") return None async def parse_page(content): if not content: return [] soup = BeautifulSoup(content, 'lxml') items = soup.select('.item') data = [] for item in items: title = item.select_one('.title').text.strip() price = item.select_one('.price').text.strip() data.append({'title': title, 'price': price}) return data async def scrape_urls(urls): connector = aiohttp.TCPConnector(limit=10) # 限制并发数 timeout = aiohttp.ClientTimeout(total=30) async with aiohttp.ClientSession(connector=connector, timeout=timeout) as session: tasks = [] for url in urls: task = asyncio.create_task(fetch_page(session, url)) tasks.append(task) contents = await asyncio.gather(*tasks) parsing_tasks = [] for content in contents: parsing_tasks.append(asyncio.create_task(parse_page(content))) results = await asyncio.gather(*parsing_tasks) # 合并所有结果 all_data = [] for result in results: all_data.extend(result) return all_data # 使用示例 urls = [ 'https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3', # ... 更多URL ] start_time = time.time() results = asyncio.run(scrape_urls(urls)) end_time = time.time() print(f"采集了 {len(results)} 条数据,耗时 {end_time - start_time:.2f} 秒")
7. 数据清洗与处理 <a name="数据清洗与处理"></a>
7.1 数据清洗技术
采集的数据往往需要清洗和预处理:
python
import pandas as pd import numpy as np import re from datetime import datetime # 示例数据 data = [ {'title': ' Python编程入门 ', 'price': '¥99.00', 'date': '2023-01-15'}, {'title': '数据科学实战', 'price': '150元', 'date': '2023/02/20'}, {'title': '机器学习', 'price': '200', 'date': '无效日期'}, {'title': 'Web开发', 'price': '¥120.50', 'date': '2023-03-10'}, ] # 创建DataFrame df = pd.DataFrame(data) # 清洗标题:去除前后空格 df['title'] = df['title'].str.strip() # 清洗价格:提取数字 def clean_price(price): if isinstance(price, str): # 提取数字和小数点 numbers = re.findall(r'\d+\.?\d*', price) if numbers: return float(numbers[0]) return np.nan df['price_clean'] = df['price'].apply(clean_price) # 清洗日期 def clean_date(date_str): try: # 尝试不同日期格式 for fmt in ('%Y-%m-%d', '%Y/%m/%d', '%d-%m-%Y', '%d/%m/%Y'): try: return datetime.strptime(date_str, fmt).date() except ValueError: continue return np.nan except: return np.nan df['date_clean'] = df['date'].apply(clean_date) print("原始数据:") print(df[['title', 'price', 'date']]) print("\n清洗后的数据:") print(df[['title', 'price_clean', 'date_clean']])
7.2 数据转换与标准化
python
# 继续使用上面的DataFrame # 处理缺失值 print("缺失值统计:") print(df.isnull().sum()) # 填充缺失值 df['price_clean'] = df['price_clean'].fillna(df['price_clean'].median()) df['date_clean'] = df['date_clean'].fillna(pd.Timestamp('today').date()) # 数据类型转换 df['price_clean'] = df['price_clean'].astype(float) # 创建新特征 df['price_category'] = pd.cut(df['price_clean'], bins=[0, 100, 150, 200, np.inf], labels=['便宜', '中等', '较贵', '昂贵']) # 字符串操作 df['title_length'] = df['title'].str.len() df['has_python'] = df['title'].str.contains('Python', case=False) print("\n处理后的数据:") print(df)
7.3 数据去重与验证
python
# 数据去重 print(f"去重前数据量: {len(df)}") # 基于标题去重 df_deduplicated = df.drop_duplicates(subset=['title']) print(f"去重后数据量: {len(df_deduplicated)}") # 数据验证 def validate_row(row): errors = [] # 检查价格是否合理 if row['price_clean'] <= 0 or row['price_clean'] > 1000: errors.append(f"价格 {row['price_clean']} 不合理") # 检查日期是否在未来 if pd.notna(row['date_clean']) and row['date_clean'] > pd.Timestamp('today').date(): errors.append(f"日期 {row['date_clean']} 是未来日期") return errors if errors else None # 应用验证 df['validation_errors'] = df.apply(validate_row, axis=1) # 显示有错误的数据 invalid_data = df[df['validation_errors'].notna()] print(f"发现 {len(invalid_data)} 条无效数据") for index, row in invalid_data.iterrows(): print(f"行 {index}: {row['validation_errors']}")
8. 实战项目 <a name="实战项目"></a>
8.1 电商网站价格监控
python
import requests from bs4 import BeautifulSoup import smtplib from email.mime.text import MIMEText import time import schedule class PriceMonitor: def __init__(self, url, target_price, email_settings): self.url = url self.target_price = target_price self.email_settings = email_settings self.headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' } def get_current_price(self): try: response = requests.get(self.url, headers=self.headers, timeout=10) response.raise_for_status() soup = BeautifulSoup(response.text, 'lxml') # 根据实际网站结构调整选择器 price_element = soup.select_one('.product-price, .price, [itemprop="price"]') if price_element: price_text = price_element.get_text().strip() # 提取数字 import re price = float(re.search(r'\d+\.?\d*', price_text).group()) return price return None except Exception as e: print(f"获取价格时出错: {e}") return None def send_email_alert(self, current_price): msg = MIMEText(f""" 价格警报! 商品链接: {self.url} 当前价格: ¥{current_price} 目标价格: ¥{self.target_price} 当前价格已低于或等于您的目标价格! """) msg['Subject'] = '价格警报: 商品价格下降!' msg['From'] = self.email_settings['from_email'] msg['To'] = self.email_settings['to_email'] try: with smtplib.SMTP(self.email_settings['smtp_server'], self.email_settings['smtp_port']) as server: server.starttls() server.login(self.email_settings['username'], self.email_settings['password']) server.send_message(msg) print("警报邮件已发送") except Exception as e: print(f"发送邮件时出错: {e}") def check_price(self): print(f"检查价格... {time.strftime('%Y-%m-%d %H:%M:%S')}") current_price = self.get_current_price() if current_price is not None: print(f"当前价格: ¥{current_price}") if current_price <= self.target_price: self.send_email_alert(current_price) return True return False def run_monitor(self, check_interval_hours=1): print(f"开始监控,每 {check_interval_hours} 小时检查一次") schedule.every(check_interval_hours).hours.do(self.check_price) # 立即检查一次 self.check_price() while True: schedule.run_pending() time.sleep(60) # 每分钟检查一次任务 # 使用示例 if __name__ == "__main__": # 配置 product_url = "https://example.com/product/123" target_price = 100.0 email_settings = { 'smtp_server': 'smtp.gmail.com', 'smtp_port': 587, 'username': 'your_email@gmail.com', 'password': 'your_password', 'from_email': 'your_email@gmail.com', 'to_email': 'recipient@example.com' } monitor = PriceMonitor(product_url, target_price, email_settings) monitor.run_monitor(check_interval_hours=2)
8.2 新闻网站内容聚合
python
import requests from bs4 import BeautifulSoup import pandas as pd import json from datetime import datetime import time class NewsAggregator: def __init__(self): self.sources = { 'source1': { 'url': 'https://news-source1.com/latest', 'article_selector': '.article', 'title_selector': '.title', 'summary_selector': '.summary', 'date_selector': '.publish-date', 'link_selector': 'a[href]' }, 'source2': { 'url': 'https://news-source2.com/news', 'article_selector': '.news-item', 'title_selector': 'h2', 'summary_selector': '.description', 'date_selector': '.time', 'link_selector': 'a' } # 可以添加更多新闻源 } self.articles = [] def scrape_source(self, source_name, source_config): try: print(f"正在采集 {source_name}...") response = requests.get(source_config['url'], timeout=10) response.raise_for_status() soup = BeautifulSoup(response.text, 'lxml') article_elements = soup.select(source_config['article_selector']) for article in article_elements: try: title_elem = article.select_one(source_config['title_selector']) summary_elem = article.select_one(source_config['summary_selector']) date_elem = article.select_one(source_config['date_selector']) link_elem = article.select_one(source_config['link_selector']) if title_elem and link_elem: article_data = { 'source': source_name, 'title': title_elem.get_text().strip(), 'summary': summary_elem.get_text().strip() if summary_elem else '', 'date': date_elem.get_text().strip() if date_elem else '', 'link': link_elem['href'] if link_elem and 'href' in link_elem.attrs else '', 'scraped_at': datetime.now().isoformat() } # 确保链接是绝对URL if article_data['link'] and not article_data['link'].startswith('http'): article_data['link'] = source_config['url'] + article_data['link'] self.articles.append(article_data) except Exception as e: print(f"处理文章时出错: {e}") continue except Exception as e: print(f"采集 {source_name} 时出错: {e}") def scrape_all_sources(self): print("开始采集所有新闻源...") self.articles = [] for source_name, source_config in self.sources.items(): self.scrape_source(source_name, source_config) time.sleep(1) # 礼貌延迟 print(f"采集完成,共获取 {len(self.articles)} 篇文章") def save_to_json(self, filename): with open(filename, 'w', encoding='utf-8') as f: json.dump(self.articles, f, ensure_ascii=False, indent=2) print(f"数据已保存到 {filename}") def save_to_csv(self, filename): df = pd.DataFrame(self.articles) df.to_csv(filename, index=False, encoding='utf-8') print(f"数据已保存到 {filename}") def analyze_articles(self): df = pd.DataFrame(self.articles) if df.empty: print("没有数据可分析") return print("\n=== 数据分析 ===") print(f"总文章数: {len(df)}") print("\n各来源文章数量:") print(df['source'].value_counts()) # 时间分析(如果有日期信息) if 'date' in df.columns and not df['date'].empty: # 这里可以添加日期处理和分析逻辑 pass return df # 使用示例 if __name__ == "__main__": aggregator = NewsAggregator() aggregator.scrape_all_sources() if aggregator.articles: aggregator.save_to_json('news_articles.json') aggregator.save_to_csv('news_articles.csv') df = aggregator.analyze_articles() print("\n前5篇文章:") print(df[['source', 'title', 'date']].head())
9. 道德与法律考量 <a name="道德与法律考量"></a>
9.1 合法数据采集的重要原则
原则 | 说明 | 实践建议 |
---|---|---|
尊重 robots.txt | 遵守网站的爬虫协议 | 检查并遵守目标网站的robots.txt |
控制访问频率 | 避免对网站造成负担 | 添加延迟,限制并发请求 |
标识爬虫身份 | 诚实地标识爬虫 | 使用合适的User-Agent |
尊重版权 | 不侵犯内容版权 | 仅采集必要数据,注明来源 |
保护隐私 | 不收集个人信息 | 避免采集邮箱、电话等敏感信息 |
9.2 robots.txt 遵守示例
python
import requests from urllib.robotparser import RobotFileParser from urllib.parse import urlparse import time def check_robots_permission(url, user_agent='*'): """检查给定URL是否允许爬取""" try: # 解析URL获取基础URL parsed_url = urlparse(url) base_url = f"{parsed_url.scheme}://{parsed_url.netloc}" # 获取robots.txt robots_url = f"{base_url}/robots.txt" rp = RobotFileParser() rp.set_url(robots_url) rp.read() # 检查权限 return rp.can_fetch(user_agent, url) except Exception as e: print(f"检查robots.txt时出错: {e}") return False def respectful_crawler(url, user_agent='MyCrawler/1.0'): """遵守robots.txt的礼貌爬虫""" if not check_robots_permission(url, user_agent): print(f"根据robots.txt,不允许爬取: {url}") return None try: headers = {'User-Agent': user_agent} response = requests.get(url, headers=headers, timeout=10) response.raise_for_status() # 礼貌延迟 time.sleep(1) return response.text except Exception as e: print(f"爬取 {url} 时出错: {e}") return None # 使用示例 url = 'https://example.com/some-page' content = respectful_crawler(url) if content: print("成功获取内容") # 处理内容...
10. 总结与资源 <a name="总结与资源"></a>
10.1 数据采集最佳实践总结
实践领域 | 最佳实践 |
---|---|
代码结构 | 模块化设计,函数职责单一 |
错误处理 | 全面的异常捕获和处理 |
性能优化 | 异步请求,合理缓存 |
可维护性 | 清晰注释,配置文件管理 |
遵守规则 | 尊重robots.txt,控制频率 |
10.2 推荐学习资源
资源类型 | 推荐内容 |
---|---|
官方文档 | Requests, BeautifulSoup, Scrapy, Selenium |
在线课程 | Coursera, Udemy, 慕课网的数据采集课程 |
书籍 | 《Python网络数据采集》、《用Python写网络爬虫》 |
社区 | Stack Overflow, GitHub, 知乎相关话题 |
工具 | Postman, Chrome DevTools, Scrapy Cloud |
10.3 常见问题与解决方案
问题 | 解决方案 |
---|---|
被封IP | 使用代理IP池,降低请求频率 |
动态内容 | 使用Selenium或Requests-HTML |
验证码 | 使用验证码识别服务或手动处理 |
登录认证 | 使用会话保持cookies,处理token |
反爬机制 | 随机User-Agent,模拟人类行为 |
通过本指南,您应该已经掌握了Python数据采集从基础到高级的全套技能。记住,技术只是工具,如何使用这些工具才是关键。始终遵循道德和法律准则,负责任地进行数据采集。
Happy Scraping!