Dask study notes[3]

Dask DataFrame

  1. the improvement of pandas efficiency is necessary because of its weakling in handling big data sets .
  • of course,you can read huge file which was splited into chunks at first.
# 分块读取大型文件
chunk_size = 100000
for chunk in pd.read_csv('very_large_file.csv', chunksize=chunk_size):
    process(chunk)
  • but a better manner is using das replace with super big datas.
# 对于超出内存的数据集
import dask.dataframe as dd
ddf = dd.read_csv('extremely_large_*.csv')
result = ddf.groupby('column').mean().compute()
  1. Dask DataFrame is a good facility to process large tabular data through parallelizing pandas,which is a collection of many pandas DataFrames.it means Dask DataFrame will apply in the API and execution principle same as pandas.
  2. on a laptop Dask DataFrame can treats up to 100 GiB datas and on a cluster up to 100 TiB,the most important is it is run with Pure Python.
  3. A Dask DataFrame is separted with row-wise,indexed with row position.all pandas objects distributively stay at various machine(or one machine).
  4. load data from pandas
import dask.dataframe as dd
ddf = dd.read_csv('extremely_large_*.csv')

writing a Dask DataFrame to CSV files
在这里插入图片描述

output_csv/
├── multi_partition_0.csv   # 分区0的数据
├── multi_partition_1.csv   # 分区1的数据
├── multi_partition_2.csv   # 分区2的数据
├── multi_partition_3.csv   # 分区3的数据
├── single_file.csv         # 所有数据合并的单个文件
├── custom_0.csv.gz         # 自定义格式的分区0(压缩)
├── custom_1.csv.gz         # 自定义格式的分区1(压缩)
└── ...
import os
import dask.dataframe as dd
import pandas as pd

# 创建示例数据目录(如果不存在)
output_dir = "./output_csv"
os.makedirs(output_dir, exist_ok=True)

# 生成示例数据(模拟10000行数据)
data = {
    "id": range(1, 10001),
    "name": [f"user_{i}" for i in range(1, 10001)],
    "score": [i * 0.5 for i in range(1, 10001)],
    "date": pd.date_range("2023-01-01", periods=10000)
}

# 转换为Dask DataFrame(分成4个分区)
ddf = dd.from_pandas(pd.DataFrame(data), npartitions=4)

# 方法1:写入多个CSV(默认行为,每个分区一个文件)
ddf.to_csv(f"{output_dir}/multi_partition_*.csv", index=False)

# 方法2:写入单个CSV(适合小数据)
ddf.to_csv(f"{output_dir}/single_file.csv", single_file=True, index=False)

# 方法3:自定义格式(带压缩)
ddf.to_csv(
    f"{output_dir}/custom_*.csv.gz",
    sep="\t",          # 使用制表符分隔
    compression="gzip", # GZIP压缩
    header=False       # 不保留列名
)

print("写入完成!检查目录:", os.path.abspath(output_dir))

to inspect the content just written into csv files

import dask.dataframe as dd

# 读取刚写入的多分区CSV
output_dir="E:\learn\learnpy\output_csv"
new_ddf = dd.read_csv(f"{output_dir}/multi_partition_*.csv")
print(new_ddf.head())  # 显示前几行
  output_dir="E:\learn\learnpy\output_csv"
   id    name  score        date
0   1  user_1    0.5  2023-01-01
1   2  user_2    1.0  2023-01-02
2   3  user_3    1.5  2023-01-03
3   4  user_4    2.0  2023-01-04
4   5  user_5    2.5  2023-01-05
  1. you can treat the datas from dataframe as follows,by the way don’t forget to lazy computation of Dask .
import dask.dataframe as dd


# 读取刚写入的多分区CSV
output_dir="./output_csv"
ddf = dd.read_csv(f"{output_dir}/multi_partition_*.csv")
print(ddf[(ddf.score >= 500) & (ddf.id <= 1050) ].compute()) 


ddf = ddf[ddf.score >= 500]
result = ddf.score.mean()
print(result.compute())
        id       name  score        date
999   1000  user_1000  500.0  2025-09-26
1000  1001  user_1001  500.5  2025-09-27
1001  1002  user_1002  501.0  2025-09-28
1002  1003  user_1003  501.5  2025-09-29
1003  1004  user_1004  502.0  2025-09-30
1004  1005  user_1005  502.5  2025-10-01
1005  1006  user_1006  503.0  2025-10-02
1006  1007  user_1007  503.5  2025-10-03
1007  1008  user_1008  504.0  2025-10-04
1008  1009  user_1009  504.5  2025-10-05
1009  1010  user_1010  505.0  2025-10-06
1010  1011  user_1011  505.5  2025-10-07
1011  1012  user_1012  506.0  2025-10-08
1012  1013  user_1013  506.5  2025-10-09
1013  1014  user_1014  507.0  2025-10-10
1014  1015  user_1015  507.5  2025-10-11
1015  1016  user_1016  508.0  2025-10-12
1016  1017  user_1017  508.5  2025-10-13
1017  1018  user_1018  509.0  2025-10-14
1018  1019  user_1019  509.5  2025-10-15
1019  1020  user_1020  510.0  2025-10-16
1020  1021  user_1021  510.5  2025-10-17
1021  1022  user_1022  511.0  2025-10-18
1022  1023  user_1023  511.5  2025-10-19
1023  1024  user_1024  512.0  2025-10-20
1024  1025  user_1025  512.5  2025-10-21
1025  1026  user_1026  513.0  2025-10-22
1026  1027  user_1027  513.5  2025-10-23
1027  1028  user_1028  514.0  2025-10-24
1028  1029  user_1029  514.5  2025-10-25
1029  1030  user_1030  515.0  2025-10-26
1030  1031  user_1031  515.5  2025-10-27
1031  1032  user_1032  516.0  2025-10-28
1032  1033  user_1033  516.5  2025-10-29
1033  1034  user_1034  517.0  2025-10-30
1034  1035  user_1035  517.5  2025-10-31
1035  1036  user_1036  518.0  2025-11-01
1036  1037  user_1037  518.5  2025-11-02
1037  1038  user_1038  519.0  2025-11-03
1038  1039  user_1039  519.5  2025-11-04
1039  1040  user_1040  520.0  2025-11-05
1040  1041  user_1041  520.5  2025-11-06
1041  1042  user_1042  521.0  2025-11-07
1042  1043  user_1043  521.5  2025-11-08
1043  1044  user_1044  522.0  2025-11-09
1044  1045  user_1045  522.5  2025-11-10
1045  1046  user_1046  523.0  2025-11-11
1046  1047  user_1047  523.5  2025-11-12
1047  1048  user_1048  524.0  2025-11-13
1048  1049  user_1049  524.5  2025-11-14
1049  1050  user_1050  525.0  2025-11-15
2750.0
  1. persist() ensures the result of computation will be survived in distributed memory, to compute repetitively with same operation only does once .
import dask.dataframe as dd
import pandas as pd

# 读取刚写入的多分区CSV
output_dir="./output_csv"
ddf = dd.read_csv(f"{output_dir}/multi_partition_*.csv")
df_filtered = ddf[(ddf.score >= 500) & (ddf.id <= 1050) ].persist()  # 立即计算并缓存结果  



references

  1. https://docs.dask.org/en/stable
  2. deepseek
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

海边的水水

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值