
# pandas-ml-utils
**A note of caution**: this is a one man show hobby project in pre-alpha state mainly
serving my own needs. Be my guest and use it or extend it.
I was really sick of converting data frames to numpy arrays back and forth just to try out a
simple logistic regression. So I have started a pandas ml utilities library where
everything should be reachable from the data frame itself. Check out the following examples
to see what I mean by that.
## Fitting
### Ordinary Binary Classification
```python
import pandas as pd
import pandas_ml_utils as pmu
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
bc = load_breast_cancer()
df = pd.DataFrame(bc.data, columns = bc.feature_names)
df["label"] = bc.target
fit = df.fit_classifier(pmu.SkitModel(LogisticRegression(solver='lbfgs', max_iter=300),
pmu.FeaturesAndLabels(features=['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'worst concave points', 'worst fractal dimension'],
labels=['label'])),
test_size=0.4)
```
As a result you get a Fit object which holds the fitted model and two ClassificationSummary.
One for the training data and one for the test Data. In case of the classification was
executed in a notebook you get a nice table:

### Binary Classification with Loss
As you can see in the above example are two confusion matrices the regular well known one
and a "loss". The intend of loss matrix is to tell you if a miss classification has a cost
i.e. a loss in dollars.
```python
import pandas as pd
import pandas_ml_utils as pmu
from sklearn.linear_model import LogisticRegression
df = pd.fetch_yahoo(spy='SPY')
df["label"] = df["spy_Close"] > df["spy_Open"]
df["loss"] = (df["spy_Open"] / df["spy_Close"] - 1) * 100
fit = df.fit_classifier(pmu.SkitModel(LogisticRegression(solver='lbfgs'),
pmu.FeaturesAndLabels(features=['spy_Open', 'spy_Low'],
labels=['label'],
loss_column='loss')),
test_size=0.4)
```

Now you can see the loss in % of dollars of your miss classification. The classification
probabilities are plotted on the very top of the plot.
### Autoregressive Models and RNN Shape
It is also possible to use the FeaturesAndLabels object to generate autoregressive
features. By default lagging features results in an RNN shaped 3D array (in the format
as Keras likes it). However we can also use SkitModels the features will be implicitly
transformed back into a 2D array (by using the `reshape_rnn_as_ar` function).
```python
import pandas_ml_utils as pmu
pmu.FeaturesAndLabels(features=['feature'],
labels=['label'],
feature_lags=range(0, 10))
```
One may like to use very long lags i.e. to catch seasonal effects. Since very long lags
are a bit fuzzy I usually like to smooth them a bit by using simple averages.
```python
import pandas_ml_utils as pmu
pmu.FeaturesAndLabels(features=['feature'],
labels=['label'],
target_columns=['strike'],
loss_column='put_loss',
feature_lags=[0, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233],
lag_smoothing={
6: lambda df: df.SMA(3, price=df.columns[0]),
35: lambda df: df.SMA(5, price=df.columns[0])
})
```
Every lag from 6 onwards will be smoothed by a 3 period average, every lag from 35 onwards
with a 5 periods moving average.
## Cross Validation
It is possible to apply a cross validation algorithm to the training data (after the train
test split). In case you only want cross validation pass `test_size=0`
Note that the current implementation is just fitting the models on all folds one after the
other without any averaging of the validation loss. However the folds can be looped many
times which essentially means we invented something like fold epochs. Therefore your fitter
epochs can be reduced by division of the number of fold epochs.
```python
from sklearn.model_selection import KFold
cv = KFold(n_splits = 10)
fit = df.fit_classifier(...,
SomeModel(epochs=100/10),
test_size=0.1 # keep 10% very unseen
cross_validation=(10, cv.split),
...)
```
## Back-Testing a Model
todo ... `df.backtest_classifier(...)`
## Save, load reuse a Model
To save a model you simply call the save method on the model inside of the fit.
```
fit.model.save('/tmp/foo.model')
```
Loading is as simply as calling load on the Model object. You can immediately apply
the model on the dataframe to get back the features along with the classification
(which is just another data frame).
```python
import pandas as pd
import pandas_ml_utils as pmu
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
df = pd.DataFrame(bc.data, columns = bc.feature_names)
df.classify(pmu.Model.load('/tmp/foo.model')).tail()
```
NOTE If you have a target level for your binary classifier like all houses cheaper then
50k then you can define this target level to the FeaturesAndLabels object likes so:
`FeaturesAndLabels(target_columns=['House Price'])`. This target column is simply fed
through to the classified dataframe as target columns.
### Fitting other models then classifiers
For non classification tasks use the regressor functions the same way as the classifier
functions.
* df.fit_regressor(...)
* df.backtest_regressor(...)
* df.regress(...)
### Other utility objects
#### LazyDataFrame
Very often I need to do a lot of feature engineering. And very often I do not want to
treat averages or other engineering methods as part of the data(frame). For this use
case I have added a LazyDataFrame object wrapping around a regular DataFrame where
some columns will always be calculated on the fly.
Here is an example:
```python
import pandas_ml_utils as pmu
import pandas as pd
import talib
df = pd.fetch_yahoo(spy='SPY')
ldf = pmu.LazyDataFrame(df,
rolling_stddev=lambda x: talib.STDDEV(x['spy_Close'], timeperiod=30) / 100)
ldf["rolling_stddev"].tail() # Will always be calculated only the fly
```
#### HashableDataFrame
The hashable dataframe is nothing which should be used directly. However this is just a
hack to allow caching of feature matrices. With heavy usage of LazyDataFrame and heavily
lagging of features for AR models the training data preparation might take a long time.
To shorten this time i.e. for hyper parameter tuning a cache is very helpful (but keep
in mind this is still kind of a hack).
to set the cache size (default is 1) set the following environment variable before import
`os.environ["CACHE_FEATUES_AND_LABELS"] = "2"`. And to use the cache simply pass the
argument to the fit_classifier method like so:`df.fit_classifier(..., cache_feature_matrix=True)`
#### MultiModel
TODO describe multi models ...
## TODO
* replace hard coded summary objects by a summary provider function
* multi model is just another implementation of model
* add keras model
* add more tests
## Wanna help?
* currently I only need binary classification
* maybe you want to add a feature for multiple classes
* for non classification problems you might want to augment the `Summary`
* write some te
没有合适的资源?快使用搜索试试~ 我知道了~
pandas-ml-utils-0.0.8.tar.gz
需积分: 1 0 下载量 36 浏览量
2024-03-07
12:44:24
上传
评论
收藏 540KB GZ 举报
温馨提示
Python库是一组预先编写的代码模块,旨在帮助开发者实现特定的编程任务,无需从零开始编写代码。这些库可以包括各种功能,如数学运算、文件操作、数据分析和网络编程等。Python社区提供了大量的第三方库,如NumPy、Pandas和Requests,极大地丰富了Python的应用领域,从数据科学到Web开发。Python库的丰富性是Python成为最受欢迎的编程语言之一的关键原因之一。这些库不仅为初学者提供了快速入门的途径,而且为经验丰富的开发者提供了强大的工具,以高效率、高质量地完成复杂任务。例如,Matplotlib和Seaborn库在数据可视化领域内非常受欢迎,它们提供了广泛的工具和技术,可以创建高度定制化的图表和图形,帮助数据科学家和分析师在数据探索和结果展示中更有效地传达信息。
资源推荐
资源详情
资源评论




























收起资源包目录
























































共 44 条
- 1
资源评论


程序员Chino的日记
- 粉丝: 4199
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 数学建模算法与应用课件第二版附录软件的使用.pptx
- GB_T_39550_2020_电子商务平台知识产权保护管理.pdf
- 基于邮政三网的电子商务系统设计与研究的开题报告.docx
- Win7操作系统十种个性化设置方法介绍.doc
- 梁春晓-合作共享与电子商务未来华东电商生态大会华西村45讲解学习.ppt
- omron-PLC程序传送操作.ppt
- 宝宝乐网站商业计划书.doc
- 2023年光纤通信知识点归纳.doc
- MCS51单片机温度控制系统设计说明.doc
- 第四章国际工程项目管理实务.ppt
- 《数据库应用基础》理论教案.doc
- 2023年未来教育计算机二级操作题答案.docx
- 中国互联网-大中型客车行业发展模式分析与投资潜力预测分析-行业统计分析(目录).doc
- 家用遥测心电监护系统软件的设计与实现的开题报告.docx
- VisualBasic企业客户管理系统毕业设计.doc
- 大数据环境中企业文书档案的信息化管理及利用分析.docx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
