使用joblib多进程来读取excel

最新推荐文章于 2025-06-24 16:02:19 发布

原创最新推荐文章于 2025-06-24 16:02:19 发布 · 370 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#excel

文章介绍了如何利用Python的Joblib库进行多进程并行读取Excel文件，通过示例代码展示了如何定义read_excel函数，并使用Parallel和delayed函数来处理多个文件，特别是在设置n_jobs=-1时，会利用所有可用的核心进行并行操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

使用 joblib 多进程来读取 excel 文件需要首先安装 joblib 库，可以使用 pip 安装：

pipinstall joblib

然后就可以使用 joblib 的 Parallel 函数来并行读取 excel 文件了。

下面是一个简单的例子：

from joblib import Parallel, delayed
import pandas as pd

def read_excel(file_path):
    return pd.read_excel(file_path)

file_paths = [file1, file2, file3]
dfs = Parallel(n_jobs=-1)(delayed(read_excel)(file_path) for file_path in file_paths)

上面代码定义了一个函数read_excel() 读取excel表，然后使用 joblib 的 Parallel 函数来并行读取多个 excel 文件。

注意: 使用 'n_jobs=-1'参数，代表使用所有可用核心来并行。

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

一曲歌长安

关注关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
分享

复制链接

分享到 QQ

分享到新浪微博

扫一扫
举报

举报

Python中使用concurrent.futures和openpyxl实现多进程写入与读取Excel数据

weixin_67244432的博客

04-09

1358

Python中使用concurrent.futures和openpyxl实现多进程写入与读取Excel数据

Python使用joblib 库处理计算密集型（多进程）

weixin_44098348的博客

09-05

541

【代码】Python使用joblib 库处理计算密集型（多进程）

参与评论您还未登录，请先登录后发表或查看评论

多进程--execl

u013827488的博客

05-08

657

有时我们需要在子进程中执行其他程序，即替换当前进程映像，这就需要exec系列，今天试用一下execl函数原型： int execl(const char *path, const char *arg, …); 第一个参数必须是完整路径且包含文件名，第二个参数为文件名 excel 调用后，excel之后的代码不会被执行，因为子进程中代码被替换 excel 函数不会关闭原进程打开的文件描述符，除非...

joblib 多线程、多进程；concurrent.futures 多线程、多进程

weixin_42357472的博客

11-29

1227

joblib中的Parallel并行运行程序参考：https://blog.csdn.net/weixin_35757704/article/details/117841681 import time from math import sqrt def test_func_1(val): time.sleep(1) return sqrt(val**2) start_time = time.time() for i in range(10): test_func_1(i) en

使用joblib 多线程/多进程

weixin_39107270的博客

03-06

2469

joblib 是一个 Python 库，用于高效的并行计算和缓存。它支持 多进程（multiprocessing）和多线程（multithreading），主要用于加速 CPU 密集型和 I/O 密集型任务。

import pandas as pd import numpy as np from minepy import MINE df = pd.read_excel(r'D:\pandas\2022年.xlsx') # 读取Excel文件，根据实际文件路径修改 df = pd.read_excel(r'D:\pandas\2022年.xlsx') # 假设目标变量是'实际功率(MW)'这一列 y = df['实际功率(MW)'] # 特征数据，假设除了目标变量列外其他列都是特征 X = df.drop(columns=['实际功率(MW)']) # 读取Excel文件，根据实际文件路径修改 df = pd.read_excel(r'D:\pandas\2022年.xlsx') # 假设目标变量是'实际功率(MW)'这一列 y = df['实际功率(MW)'] # 特征数据，假设除了目标变量列外其他列都是特征 X = df.drop(columns=['实际功率(MW)']) def calculate_mic(x, y): """计算两组数据的最大信息系数""" m = MINE() m.compute_score(x, y) return m.mic() mic_scores = [] for col in X.columns: mic_value = calculate_mic(X[col], y) mic_scores.append((col, mic_value)在这个代码里面做更改

03-17

3. **并行计算**：使用多进程或Joblib库并行计算特征对，减少总运行时间。 4. **缓存中间结果**：避免重复计算，特别是当多次运行代码时，可以保存已计算的MIC值。 5. **内存优化**：处理大数据时，适当减少数据精度...

Python读取CSV文件：测试驱动开发和错误处理

在 Python 中，我们可以使用 `csv` 模块轻松读取 CSV 文件。要读取 CSV 文件，我们可以使用 `csv.reader()` 函数，它返回一个迭代器，可以逐行遍历文件中的数据。每个行表示为一个列表，其中每个元素都是该行中的...

运行代码： import pandas as pd import numpy as np import os import sys import io # ------------------------- 编码兼容设置 ------------------------- # 强制标准输出/错误流使用UTF-8编码 sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8') sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8') # 设置Python环境变量，避免joblib并行计算时的编码问题 os.environ["PYTHONIOENCODING"] = "utf-8" # --------------------------------------------------------------- from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression, Ridge, Lasso from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn.svm import SVR from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error import matplotlib.pyplot as plt import seaborn as sns # 1. 加载数据（明确指定文件路径编码） file_path = r"C:\Users\刘涵\Desktop\数模标准\模拟题一\C题\卷烟吸阻数据.xlsx" # 原始路径（含中文） # 读取原始数据（Sheet1）或标准化数据（Sheet2） try: df = pd.read_excel(file_path, sheet_name='Sheet1') # 原始数据 # df = pd.read_excel(file_path, sheet_name='Sheet2') # 标准化数据（可选） except UnicodeDecodeError: # 如果仍报错，尝试用二进制模式读取并指定编码（仅适用于xlsx文件） df = pd.read_excel(file_path, engine='openpyxl', sheet_name='Sheet1') # 显示数据基本信息 print("数据基本信息：") print(df.info()) print("\n前5行数据：") print(df.head()) print("\n描述性统计：") print(df.describe()) # 2. 数据预处理 X = df.drop(columns=['吸阻(Pa)']) # 特征 y = df['吸阻(Pa)'] # 目标变量 # 数据标准化（仅对原始数据需要，若使用Sheet2的标准化数据则跳过） scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # 3. 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split( X_scaled, y, test_size=0.2, random_state=42 ) # 4. 定义模型及超参数网格 models = { '线性回归': LinearRegression(), '岭回归': Ridge(), 'Lasso回归': Lasso(), '随机森林': RandomForestRegressor(random_state=42), '梯度提升': GradientBoostingRegressor(random_state=42), '支持向量机': SVR() } param_grids = { '岭回归': {'alpha': [0.1, 1, 10]}, 'Lasso回归': {'alpha': [0.001, 0.01, 0.1]}, '随机森林': { 'n_estimators': [100, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5] }, '梯度提升': { 'n_estimators': [100, 200], 'learning_rate': [0.05, 0.1, 0.2], 'max_depth': [3, 5] }, '支持向量机': { 'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf', 'poly'], 'gamma': ['scale', 'auto'] } } # 5. 模型训练与调优（增加异常捕获） results = {} best_models = {} for name, model in models.items(): print(f"\n=== 训练模型: {name} ===") try: if name in param_grids: # 使用网格搜索优化超参数 grid_search = GridSearchCV( estimator=model, param_grid=param_grids[name], cv=5, scoring='neg_mean_squared_error', n_jobs=-1 # 若仍报错，可改为n_jobs=1（关闭并行） ) grid_search.fit(X_train, y_train) best_model = grid_search.best_estimator_ best_params = grid_search.best_params_ print(f"最佳参数: {best_params}") else: # 直接训练基础模型 best_model = model.fit(X_train, y_train) best_params = "无超参数可调" # 预测与评估 y_pred = best_model.predict(X_test) mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) mae = mean_absolute_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) results[name] = { 'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'R²': r2, '最佳参数': best_params } best_models[name] = best_model print(f"测试集评估结果:\nMSE={mse:.2f}\nRMSE={rmse:.2f}\nMAE={mae:.2f}\nR²={r2:.4f}") except Exception as e: print(f"训练模型 {name} 时出错: {str(e)}") continue # 6. 模型性能对比 if results: results_df = pd.DataFrame(results).T print("\n=== 模型性能对比 ===") print(results_df.sort_values('R²', ascending=False)) else: print("所有模型训练失败，请检查数据和环境配置。") # 7. 可视化分析（仅当有模型成功时执行） if '随机森林' in best_models: try: # 特征重要性 feature_importances = best_models['随机森林'].feature_importances_ feature_importance_df = pd.DataFrame({ '特征': X.columns, '重要性': feature_importances }).sort_values('重要性', ascending=False) plt.figure(figsize=(12, 8)) sns.barplot(x='重要性', y='特征', data=feature_importance_df.head(10)) plt.title('Top 10 重要特征') plt.show() except Exception as e: print(f"绘制特征重要性图时出错: {str(e)}") # 8. 模型保存（可选） # from joblib import dump # if '随机森林' in best_models: # dump(best_models['随机森林'], 'best_cigarette_resistance_model.pkl') # dump(scaler, 'scaler.pkl') 出现： AttributeError Traceback (most recent call last) Cell In[1], line 9 5 import io 7 # ------------------------- 编码兼容设置 ------------------------- 8 # 强制标准输出/错误流使用UTF-8编码 ----> 9 sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8') 10 sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8') 11 # 设置Python环境变量，避免joblib并行计算时的编码问题 AttributeError: 'OutStream' object has no attribute 'buffer'。输出完整的解决问题修改后的代码