

# Pandas ML Utils
Pandas ML Utils is intended to help you through your journey of applying statistical oder machine learning models to data while you never need to leave the world of pandas.
1. install
1. analyze your features
1. find a model
1. save and reuse your model
Or [read the docs](https://pandas-ml-utils.readthedocs.io/en/latest/).
## Install
```bash
pip install pandas-ml-utils
```
## Analyze your Features
The feature_selection functionality helps you to analyze your features, filter out highly correlated once and focus on the most important features. This function also applies an auto regression and embeds and ACF plot.
```python
import pandas_ml_utils as pmu
import pandas as pd
df = pd.read_csv('burritos.csv')[["Tortilla", "Temp", "Meat", "Fillings", "Meat:filling", "Uniformity", "Salsa", "Synergy", "Wrap", "overall"]]
df.feature_selection(label_column="overall")
```

Tortilla overall Synergy Fillings Temp Salsa \
Tortilla 1.0 0.403981 0.367575 0.345613 0.290702 0.267212
Meat Uniformity Meat:filling Wrap
Tortilla 0.260194 0.208666 0.207518 0.160831
label is continuous: True

Feature ranking:
['Synergy', 'Meat', 'Fillings', 'Meat:filling', 'Wrap', 'Tortilla', 'Uniformity', 'Salsa', 'Temp']
TOP 5 features
Synergy Meat Fillings Meat:filling Wrap
Synergy 1.0 0.601545 0.663328 0.428505 0.08685
filtered features with correlation < 0.5
Synergy Meat:filling Wrap
Tortilla 0.367575 0.207518 0.160831


Synergy 1.000000
Synergy_0 1.000000
Synergy_1 0.147495
Synergy_56 0.128449
Synergy_78 0.119272
Synergy_55 0.111832
Synergy_79 0.086466
Synergy_47 0.085117
Synergy_53 0.084786
Synergy_37 0.084312
Name: Synergy, dtype: float64

Meat:filling 1.000000
Meat:filling_0 1.000000
Meat:filling_15 0.185946
Meat:filling_35 0.175837
Meat:filling_1 0.122546
Meat:filling_87 0.118597
Meat:filling_33 0.112875
Meat:filling_73 0.103090
Meat:filling_72 0.103054
Meat:filling_71 0.089437
Name: Meat:filling, dtype: float64

Wrap 1.000000
Wrap_0 1.000000
Wrap_63 0.210823
Wrap_88 0.189735
Wrap_1 0.169132
Wrap_87 0.166502
Wrap_66 0.146689
Wrap_89 0.141822
Wrap_74 0.120047
Wrap_11 0.115095
Name: Wrap, dtype: float64
best lags are
[(1, '-1.00'), (2, '-0.15'), (88, '-0.10'), (64, '-0.07'), (19, '-0.07'), (89, '-0.06'), (36, '-0.05'), (43, '-0.05'), (16, '-0.05'), (68, '-0.04'), (90, '-0.04'), (87, '-0.04'), (3, '-0.03'), (20, '-0.03'), (59, '-0.03'), (75, '-0.03'), (91, '-0.03'), (57, '-0.03'), (46, '-0.02'), (48, '-0.02'), (54, '-0.02'), (73, '-0.02'), (25, '-0.02'), (79, '-0.02'), (76, '-0.02'), (37, '-0.02'), (71, '-0.02'), (15, '-0.02'), (49, '-0.02'), (12, '-0.02'), (65, '-0.02'), (40, '-0.02'), (24, '-0.02'), (78, '-0.02'), (53, '-0.02'), (8, '-0.02'), (44, '-0.01'), (45, '0.01'), (56, '0.01'), (26, '0.01'), (82, '0.01'), (77, '0.02'), (22, '0.02'), (83, '0.02'), (11, '0.02'), (66, '0.02'), (31, '0.02'), (80, '0.02'), (92, '0.02'), (39, '0.03'), (27, '0.03'), (70, '0.04'), (41, '0.04'), (51, '0.04'), (4, '0.04'), (7, '0.05'), (13, '0.05'), (97, '0.06'), (60, '0.06'), (42, '0.06'), (96, '0.06'), (95, '0.06'), (30, '0.07'), (81, '0.07'), (52, '0.07'), (9, '0.07'), (61, '0.07'), (84, '0.07'), (29, '0.08'), (94, '0.08'), (28, '0.11')]
## Fit a Model
Once you know your features you can start to try out different models i.e. a very basic
Logistic Regression. It is also possible to search through a set of hyper parameters.
```python
import pandas as pd
import pandas_ml_utils as pmu
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('burritos.csv')
df["with_fires"] = df["Fries"].apply(lambda x: str(x).lower() == "x")
df["price"] = df["Cost"] * -1
df = df[["Tortilla", "Temp", "Meat", "Fillings", "Meat:filling", "Uniformity", "Salsa", "Synergy", "Wrap", "overall", "with_fires", "price"]].dropna()
fit = df.fit_classifier(pmu.SkitModel(LogisticRegression(solver='lbfgs'),
pmu.FeaturesAndLabels(["Tortilla", "Temp", "Meat", "Fillings", "Meat:filling",
"Uniformity", "Salsa", "Synergy", "Wrap", "overall"],
["with_fires"],
targets=("price", "price"))))
fit
```

## Save and use your model
Once you are happy with your model you can save it and apply it on any DataFrame which
serves the needed columns by your features.
```python
fit.save_model("/tmp/burrito.model")
```
```python
df = pd.read_csv('burritos.csv')
df["price"] = df["Cost"] * -1
df = df[["Tortilla", "Temp", "Meat", "Fillings", "Meat:filling", "Uniformity", "Salsa", "Synergy", "Wrap", "overall", "price"]].dropna()
df.classify(pmu.Model.load("/tmp/burrito.model")).tail()
```
<div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th colspan="3" halign="left">price</th>
</tr>
<tr>
<th></th>
<th colspan="2" halign="left">prediction</th>
<th>target</th>
</tr>
<tr>
<th></th>
<th>value</th>
<th>value_proba</th>
<th>value</th>
</tr>
</thead>
<tbody>
<tr>
<th>380</th>
<td>False</td>
<td>0.251311</td>
<td>-6.85</td>
</tr>
<tr>
<th>381</th>
<td>False</td>
<td>0.328659</td>
<td>-6.85</td>
</tr>
<tr>
<th>382</th>
<td>False</td>
<td>0.064751</td>
<td>-11.50</td>
</tr>
<tr>
<th>383</th>
<td>False</td>
<td>0.428745</td>
<td>-7.89</td>
</tr>
<tr>
<th>384</th>
<td>False</td>
<td>0.265546</td>
<td>-7.89</td>
</tr>
</tbody>
</table>
</div>
## TODO
* allow multiple class for classification
* replace hard coded summary objects by a summary provider function
* add more tests
* add Proximity https://stats.stackexchange.com/questions/270201/pooling-levels-of-categorical-variables-for-regression-trees/275867#275867
## Wanna help?
* currently I only need binary classification
* maybe you want to add a feature for multiple classes
* for non classification problems you might want to augment the `Summary`
* write some tests
* add different more charts for a better understanding/interpretation of the models
* add whatever you need for yourself and share it with us
## Change Log
### 0.0.12
* added sphinx documentation
* added multi model as regular model which has quite a big impact
* features and labels signature changed
* multiple targets has now the consequence that a lot of things a returning a dict now
* everything is using now DataFrames instead of arrays after plain model invoke
* added some tests
* fixed some bugs a long the way
### 0.0.11
* Added Hyper parameter tuning
```python
from hyperopt import hp
fit = df.fit_classifier(
pdu.SkitModel(MLPClassifier(activation='tanh', hidden_layer_sizes=(60, 50), random_state=42),
pdu.FeaturesAndLabels(features=['vix_Close'], labels=['label'],
targets=("vix_Open", "spy_Volume"))),
test_size=0.4,
没有合适的资源?快使用搜索试试~ 我知道了~
pandas-ml-utils-0.0.15.tar.gz
需积分: 1 0 下载量 160 浏览量
2024-03-07
12:45:08
上传
评论
收藏 755KB GZ 举报
温馨提示
Python库是一组预先编写的代码模块,旨在帮助开发者实现特定的编程任务,无需从零开始编写代码。这些库可以包括各种功能,如数学运算、文件操作、数据分析和网络编程等。Python社区提供了大量的第三方库,如NumPy、Pandas和Requests,极大地丰富了Python的应用领域,从数据科学到Web开发。Python库的丰富性是Python成为最受欢迎的编程语言之一的关键原因之一。这些库不仅为初学者提供了快速入门的途径,而且为经验丰富的开发者提供了强大的工具,以高效率、高质量地完成复杂任务。例如,Matplotlib和Seaborn库在数据可视化领域内非常受欢迎,它们提供了广泛的工具和技术,可以创建高度定制化的图表和图形,帮助数据科学家和分析师在数据探索和结果展示中更有效地传达信息。
资源推荐
资源详情
资源评论


















收起资源包目录






























































































共 78 条
- 1
资源评论


程序员Chino的日记
- 粉丝: 4224
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 基于单片机的天然气泄漏检测系统设计.doc
- 互联网加创业项目计划书.doc
- 基因工程-第六章-外源目的基因表达与调控.ppt
- 计算机系应届毕业生的暑假实习报告.docx
- 小程序商城源码-Java-C语言资源
- 可编程序控制器课件PPT课件.ppt
- 物联网职业生涯规划.doc
- 国家开放大学电大《网络营销与策划》机考3套真题题库及答案6.docx
- 公司网络营销活动策划方案.doc
- 项目管理(20211102052620)[最终版].pdf
- 基于Simulink强化学习工具箱的DDPG算法ACC自适应巡航控制器设计与实现 · DDPG算法 v1.2
- 制药工程项目建设与项目管理培训课件.pptx
- 最新国家开放大学电大《环境水利学》网络核心课形考网考作业及答案.pdf
- 基于Android平台的智能家居系统设计.doc
- C语言顺序结构测验.doc
- 计算机发展历程.ppt
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
