Dask study notes[3]

海边的水水

已于 2025-07-14 23:44:11 修改

阅读量321

点赞数 15

CC 4.0 BY-SA版权

分类专栏： JAX/Tensorflow 文章标签： pandas python dask

于 2025-07-14 12:29:57 首次发布

本文链接：https://blog.csdn.net/sakura_sea/article/details/149327475

JAX/Tensorflow 专栏收录该内容

28 篇文章

订阅专栏

文章目录

Dask DataFrame
references

Dask DataFrame

the improvement of pandas efficiency is necessary because of its weakling in handling big data sets .

of course,you can read huge file which was splited into chunks at first.

# 分块读取大型文件
chunk_size = 100000
for chunk in pd.read_csv('very_large_file.csv', chunksize=chunk_size):
    process(chunk)

but a better manner is using das replace with super big datas.

# 对于超出内存的数据集
import dask.dataframe as dd
ddf = dd.read_csv('extremely_large_*.csv')
result = ddf.groupby('column').mean().compute()

Dask DataFrame is a good facility to process large tabular data through parallelizing pandas,which is a collection of many pandas DataFrames.it means Dask DataFrame will apply in the API and execution principle same as pandas.
on a laptop Dask DataFrame can treats up to 100 GiB datas and on a cluster up to 100 TiB,the most important is it is run with Pure Python.
A Dask DataFrame is separted with row-wise,indexed with row position.all pandas objects distributively stay at various machine(or one machine).
load data from pandas

import dask.dataframe as dd
ddf = dd.read_csv('extremely_large_*.csv')

writing a Dask DataFrame to CSV files
在这里插入图片描述

output_csv/
├── multi_partition_0.csv   # 分区0的数据
├── multi_partition_1.csv   # 分区1的数据
├── multi_partition_2.csv   # 分区2的数据
├── multi_partition_3.csv   # 分区3的数据
├── single_file.csv         # 所有数据合并的单个文件
├── custom_0.csv.gz         # 自定义格式的分区0（压缩）
├── custom_1.csv.gz         # 自定义格式的分区1（压缩）
└── ...

import os
import dask.dataframe as dd
import pandas as pd

# 创建示例数据目录（如果不存在）
output_dir = "./output_csv"
os.makedirs(output_dir, exist_ok=True)

# 生成示例数据（模拟10000行数据）
data = {
    "id": range(1, 10001),
    "name": [f"user_{i}" for i in range(1, 10001)],
    "score": [i * 0.5 for i in range(1, 10001)],
    "date": pd.date_range("2023-01-01", periods=10000)
}

# 转换为Dask DataFrame（分成4个分区）
ddf = dd.from_pandas(pd.DataFrame(data), npartitions=4)

# 方法1：写入多个CSV（默认行为，每个分区一个文件）
ddf.to_csv(f"{output_dir}/multi_partition_*.csv", index=False)

# 方法2：写入单个CSV（适合小数据）
ddf.to_csv(f"{output_dir}/single_file.csv", single_file=True, index=False)

# 方法3：自定义格式（带压缩）
ddf.to_csv(
    f"{output_dir}/custom_*.csv.gz",
    sep="\t",          # 使用制表符分隔
    compression="gzip", # GZIP压缩
    header=False       # 不保留列名
)

print("写入完成！检查目录:", os.path.abspath(output_dir))

to inspect the content just written into csv files

import dask.dataframe as dd

# 读取刚写入的多分区CSV
output_dir="E:\learn\learnpy\output_csv"
new_ddf = dd.read_csv(f"{output_dir}/multi_partition_*.csv")
print(new_ddf.head())  # 显示前几行

  output_dir="E:\learn\learnpy\output_csv"
   id    name  score        date
0   1  user_1    0.5  2023-01-01
1   2  user_2    1.0  2023-01-02
2   3  user_3    1.5  2023-01-03
3   4  user_4    2.0  2023-01-04
4   5  user_5    2.5  2023-01-05

you can treat the datas from dataframe as follows,by the way don’t forget to lazy computation of Dask .

import dask.dataframe as dd


# 读取刚写入的多分区CSV
output_dir="./output_csv"
ddf = dd.read_csv(f"{output_dir}/multi_partition_*.csv")
print(ddf[(ddf.score >= 500) & (ddf.id <= 1050) ].compute()) 


ddf = ddf[ddf.score >= 500]
result = ddf.score.mean()
print(result.compute())

        id       name  score        date
999   1000  user_1000  500.0  2025-09-26
1000  1001  user_1001  500.5  2025-09-27
1001  1002  user_1002  501.0  2025-09-28
1002  1003  user_1003  501.5  2025-09-29
1003  1004  user_1004  502.0  2025-09-30
1004  1005  user_1005  502.5  2025-10-01
1005  1006  user_1006  503.0  2025-10-02
1006  1007  user_1007  503.5  2025-10-03
1007  1008  user_1008  504.0  2025-10-04
1008  1009  user_1009  504.5  2025-10-05
1009  1010  user_1010  505.0  2025-10-06
1010  1011  user_1011  505.5  2025-10-07
1011  1012  user_1012  506.0  2025-10-08
1012  1013  user_1013  506.5  2025-10-09
1013  1014  user_1014  507.0  2025-10-10
1014  1015  user_1015  507.5  2025-10-11
1015  1016  user_1016  508.0  2025-10-12
1016  1017  user_1017  508.5  2025-10-13
1017  1018  user_1018  509.0  2025-10-14
1018  1019  user_1019  509.5  2025-10-15
1019  1020  user_1020  510.0  2025-10-16
1020  1021  user_1021  510.5  2025-10-17
1021  1022  user_1022  511.0  2025-10-18
1022  1023  user_1023  511.5  2025-10-19
1023  1024  user_1024  512.0  2025-10-20
1024  1025  user_1025  512.5  2025-10-21
1025  1026  user_1026  513.0  2025-10-22
1026  1027  user_1027  513.5  2025-10-23
1027  1028  user_1028  514.0  2025-10-24
1028  1029  user_1029  514.5  2025-10-25
1029  1030  user_1030  515.0  2025-10-26
1030  1031  user_1031  515.5  2025-10-27
1031  1032  user_1032  516.0  2025-10-28
1032  1033  user_1033  516.5  2025-10-29
1033  1034  user_1034  517.0  2025-10-30
1034  1035  user_1035  517.5  2025-10-31
1035  1036  user_1036  518.0  2025-11-01
1036  1037  user_1037  518.5  2025-11-02
1037  1038  user_1038  519.0  2025-11-03
1038  1039  user_1039  519.5  2025-11-04
1039  1040  user_1040  520.0  2025-11-05
1040  1041  user_1041  520.5  2025-11-06
1041  1042  user_1042  521.0  2025-11-07
1042  1043  user_1043  521.5  2025-11-08
1043  1044  user_1044  522.0  2025-11-09
1044  1045  user_1045  522.5  2025-11-10
1045  1046  user_1046  523.0  2025-11-11
1046  1047  user_1047  523.5  2025-11-12
1047  1048  user_1048  524.0  2025-11-13
1048  1049  user_1049  524.5  2025-11-14
1049  1050  user_1050  525.0  2025-11-15
2750.0

persist() ensures the result of computation will be survived in distributed memory, to compute repetitively with same operation only does once .

import dask.dataframe as dd
import pandas as pd

# 读取刚写入的多分区CSV
output_dir="./output_csv"
ddf = dd.read_csv(f"{output_dir}/multi_partition_*.csv")
df_filtered = ddf[(ddf.score >= 500) & (ddf.id <= 1050) ].persist()  # 立即计算并缓存结果