Dask DataFrame
- the improvement of pandas efficiency is necessary because of its weakling in handling big data sets .
- of course,you can read huge file which was splited into chunks at first.
# 分块读取大型文件
chunk_size = 100000
for chunk in pd.read_csv('very_large_file.csv', chunksize=chunk_size):
process(chunk)
- but a better manner is using das replace with super big datas.
# 对于超出内存的数据集
import dask.dataframe as dd
ddf = dd.read_csv('extremely_large_*.csv')
result = ddf.groupby('column').mean().compute()
- Dask DataFrame is a good facility to process large tabular data through parallelizing pandas,which is a collection of many pandas DataFrames.it means Dask DataFrame will apply in the API and execution principle same as pandas.
- on a laptop Dask DataFrame can treats up to 100 GiB datas and on a cluster up to 100 TiB,the most important is it is run with Pure Python.
- A Dask DataFrame is separted with row-wise,indexed with row position.all pandas objects distributively stay at various machine(or one machine).
- load data from pandas
import dask.dataframe as dd
ddf = dd.read_csv('extremely_large_*.csv')
writing a Dask DataFrame to CSV files
output_csv/
├── multi_partition_0.csv # 分区0的数据
├── multi_partition_1.csv # 分区1的数据
├── multi_partition_2.csv # 分区2的数据
├── multi_partition_3.csv # 分区3的数据
├── single_file.csv # 所有数据合并的单个文件
├── custom_0.csv.gz # 自定义格式的分区0(压缩)
├── custom_1.csv.gz # 自定义格式的分区1(压缩)
└── ...
import os
import dask.dataframe as dd
import pandas as pd
# 创建示例数据目录(如果不存在)
output_dir = "./output_csv"
os.makedirs(output_dir, exist_ok=True)
# 生成示例数据(模拟10000行数据)
data = {
"id": range(1, 10001),
"name": [f"user_{i}" for i in range(1, 10001)],
"score": [i * 0.5 for i in range(1, 10001)],
"date": pd.date_range("2023-01-01", periods=10000)
}
# 转换为Dask DataFrame(分成4个分区)
ddf = dd.from_pandas(pd.DataFrame(data), npartitions=4)
# 方法1:写入多个CSV(默认行为,每个分区一个文件)
ddf.to_csv(f"{output_dir}/multi_partition_*.csv", index=False)
# 方法2:写入单个CSV(适合小数据)
ddf.to_csv(f"{output_dir}/single_file.csv", single_file=True, index=False)
# 方法3:自定义格式(带压缩)
ddf.to_csv(
f"{output_dir}/custom_*.csv.gz",
sep="\t", # 使用制表符分隔
compression="gzip", # GZIP压缩
header=False # 不保留列名
)
print("写入完成!检查目录:", os.path.abspath(output_dir))
to inspect the content just written into csv files
import dask.dataframe as dd
# 读取刚写入的多分区CSV
output_dir="E:\learn\learnpy\output_csv"
new_ddf = dd.read_csv(f"{output_dir}/multi_partition_*.csv")
print(new_ddf.head()) # 显示前几行
output_dir="E:\learn\learnpy\output_csv"
id name score date
0 1 user_1 0.5 2023-01-01
1 2 user_2 1.0 2023-01-02
2 3 user_3 1.5 2023-01-03
3 4 user_4 2.0 2023-01-04
4 5 user_5 2.5 2023-01-05
- you can treat the datas from dataframe as follows,by the way don’t forget to lazy computation of Dask .
import dask.dataframe as dd
# 读取刚写入的多分区CSV
output_dir="./output_csv"
ddf = dd.read_csv(f"{output_dir}/multi_partition_*.csv")
print(ddf[(ddf.score >= 500) & (ddf.id <= 1050) ].compute())
ddf = ddf[ddf.score >= 500]
result = ddf.score.mean()
print(result.compute())
id name score date
999 1000 user_1000 500.0 2025-09-26
1000 1001 user_1001 500.5 2025-09-27
1001 1002 user_1002 501.0 2025-09-28
1002 1003 user_1003 501.5 2025-09-29
1003 1004 user_1004 502.0 2025-09-30
1004 1005 user_1005 502.5 2025-10-01
1005 1006 user_1006 503.0 2025-10-02
1006 1007 user_1007 503.5 2025-10-03
1007 1008 user_1008 504.0 2025-10-04
1008 1009 user_1009 504.5 2025-10-05
1009 1010 user_1010 505.0 2025-10-06
1010 1011 user_1011 505.5 2025-10-07
1011 1012 user_1012 506.0 2025-10-08
1012 1013 user_1013 506.5 2025-10-09
1013 1014 user_1014 507.0 2025-10-10
1014 1015 user_1015 507.5 2025-10-11
1015 1016 user_1016 508.0 2025-10-12
1016 1017 user_1017 508.5 2025-10-13
1017 1018 user_1018 509.0 2025-10-14
1018 1019 user_1019 509.5 2025-10-15
1019 1020 user_1020 510.0 2025-10-16
1020 1021 user_1021 510.5 2025-10-17
1021 1022 user_1022 511.0 2025-10-18
1022 1023 user_1023 511.5 2025-10-19
1023 1024 user_1024 512.0 2025-10-20
1024 1025 user_1025 512.5 2025-10-21
1025 1026 user_1026 513.0 2025-10-22
1026 1027 user_1027 513.5 2025-10-23
1027 1028 user_1028 514.0 2025-10-24
1028 1029 user_1029 514.5 2025-10-25
1029 1030 user_1030 515.0 2025-10-26
1030 1031 user_1031 515.5 2025-10-27
1031 1032 user_1032 516.0 2025-10-28
1032 1033 user_1033 516.5 2025-10-29
1033 1034 user_1034 517.0 2025-10-30
1034 1035 user_1035 517.5 2025-10-31
1035 1036 user_1036 518.0 2025-11-01
1036 1037 user_1037 518.5 2025-11-02
1037 1038 user_1038 519.0 2025-11-03
1038 1039 user_1039 519.5 2025-11-04
1039 1040 user_1040 520.0 2025-11-05
1040 1041 user_1041 520.5 2025-11-06
1041 1042 user_1042 521.0 2025-11-07
1042 1043 user_1043 521.5 2025-11-08
1043 1044 user_1044 522.0 2025-11-09
1044 1045 user_1045 522.5 2025-11-10
1045 1046 user_1046 523.0 2025-11-11
1046 1047 user_1047 523.5 2025-11-12
1047 1048 user_1048 524.0 2025-11-13
1048 1049 user_1049 524.5 2025-11-14
1049 1050 user_1050 525.0 2025-11-15
2750.0
persist()
ensures the result of computation will be survived in distributed memory, to compute repetitively with same operation only does once .
import dask.dataframe as dd
import pandas as pd
# 读取刚写入的多分区CSV
output_dir="./output_csv"
ddf = dd.read_csv(f"{output_dir}/multi_partition_*.csv")
df_filtered = ddf[(ddf.score >= 500) & (ddf.id <= 1050) ].persist() # 立即计算并缓存结果