
# pandas-ml-utils
**A note of caution**: this is a one man show hobby project in pre-alpha state mainly
serving my own needs. Be my guest and use it or extend it.
I was really sick of converting data frames to numpy arrays back and forth just to try out a
simple logistic regression. So I have started a pandas ml utilities library where
everything should be reachable from the data frame itself. Check out the following examples
to see what I mean by that.
## Fitting
### Ordinary Binary Classification
```python
import pandas as pd
import pandas_ml_utils as pmu
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
bc = load_breast_cancer()
df = pd.DataFrame(bc.data, columns = bc.feature_names)
df["label"] = bc.target
fit = df.fit_classifier(pmu.SkitModel(LogisticRegression(solver='lbfgs', max_iter=300),
pmu.FeaturesAndLabels(features=['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'worst concave points', 'worst fractal dimension'],
labels=['label'])),
test_size=0.4)
```
As a result you get a Fit object which holds the fitted model and two ClassificationSummary.
One for the training data and one for the test Data. In case of the classification was
executed in a notebook you get a nice table:

### Binary Classification with Loss
As you can see in the above example are two confusion matrices the regular well known one
and a "loss". The intend of loss matrix is to tell you if a miss classification has a cost
i.e. a loss in dollars.
```python
import pandas as pd
import pandas_ml_utils as pmu
from sklearn.linear_model import LogisticRegression
df = pd.fetch_yahoo(spy='SPY')
df["label"] = df["spy_Close"] > df["spy_Open"]
df["loss"] = (df["spy_Open"] / df["spy_Close"] - 1) * 100
fit = df.fit_classifier(pmu.SkitModel(LogisticRegression(solver='lbfgs'),
pmu.FeaturesAndLabels(features=['spy_Open', 'spy_Low'],
labels=['label'],
loss_column='loss')),
test_size=0.4)
```

Now you can see the loss in % of dollars of your miss classification. The classification
probabilities are plotted on the very top of the plot.
### Autoregressive Models and RNN Shape
It is also possible to use the FeaturesAndLabels object to generate autoregressive
features. By default lagging features results in an RNN shaped 3D array (in the format
as Keras likes it). However we can also use SkitModels the features will be implicitly
transformed back into a 2D array (by using the `reshape_rnn_as_ar` function).
```python
import pandas_ml_utils as pmu
pmu.FeaturesAndLabels(features=['feature'],
labels=['label'],
feature_lags=range(0, 10))
```
One may like to use very long lags i.e. to catch seasonal effects. Since very long lags
are a bit fuzzy I usually like to smooth them a bit by using simple averages.
```python
import pandas_ml_utils as pmu
pmu.FeaturesAndLabels(features=['feature'],
labels=['label'],
target_columns=['strike'],
loss_column='put_loss',
feature_lags=[0, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233],
lag_smoothing={
6: lambda df: df.SMA(3, price=df.columns[0]),
35: lambda df: df.SMA(5, price=df.columns[0])
})
```
Every lag from 6 onwards will be smoothed by a 3 period average, every lag from 35 onwards
with a 5 periods moving average.
## Cross Validation
It is possible to apply a cross validation algorithm to the training data (after the train
test split). In case you only want cross validation pass `test_size=0`
Note that the current implementation is just fitting the models on all folds one after the
other without any averaging of the validation loss. However the folds can be looped many
times which essentially means we invented something like fold epochs. Therefore your fitter
epochs can be reduced by division of the number of fold epochs.
```python
from sklearn.model_selection import KFold
cv = KFold(n_splits = 10)
fit = df.fit_classifier(...,
SomeModel(epochs=100/10),
test_size=0.1 # keep 10% very unseen
cross_validation=(10, cv.split),
...)
```
## Back-Testing a Model
todo ... `df.backtest_classifier(...)`
## Save, load reuse a Model
To save a model you simply call the save method on the model inside of the fit.
```
fit.model.save('/tmp/foo.model')
```
Loading is as simply as calling load on the Model object. You can immediately apply
the model on the dataframe to get back the features along with the classification
(which is just another data frame).
```python
import pandas as pd
import pandas_ml_utils as pmu
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
df = pd.DataFrame(bc.data, columns = bc.feature_names)
df.classify(pmu.Model.load('/tmp/foo.model')).tail()
```
NOTE If you have a target level for your binary classifier like all houses cheaper then
50k then you can define this target level to the FeaturesAndLabels object likes so:
`FeaturesAndLabels(target_columns=['House Price'])`. This target column is simply fed
through to the classified dataframe as target columns.
### Fitting other models then classifiers
For non classification tasks use the regressor functions the same way as the classifier
functions.
* df.fit_regressor(...)
* df.backtest_regressor(...)
* df.regress(...)
### Other utility objects
#### LazyDataFrame
Very often I need to do a lot of feature engineering. And very often I do not want to
treat averages or other engineering methods as part of the data(frame). For this use
case I have added a LazyDataFrame object wrapping around a regular DataFrame where
some columns will always be calculated on the fly.
Here is an example:
```python
import pandas_ml_utils as pmu
import pandas as pd
import talib
df = pd.fetch_yahoo(spy='SPY')
ldf = pmu.LazyDataFrame(df,
rolling_stddev=lambda x: talib.STDDEV(x['spy_Close'], timeperiod=30) / 100)
ldf["rolling_stddev"].tail() # Will always be calculated only the fly
```
#### HashableDataFrame
The hashable dataframe is nothing which should be used directly. However this is just a
hack to allow caching of feature matrices. With heavy usage of LazyDataFrame and heavily
lagging of features for AR models the training data preparation might take a long time.
To shorten this time i.e. for hyper parameter tuning a cache is very helpful (but keep
in mind this is still kind of a hack).
to set the cache size (default is 1) set the following environment variable before import
`os.environ["CACHE_FEATUES_AND_LABELS"] = "2"`. And to use the cache simply pass the
argument to the fit_classifier method like so:`df.fit_classifier(..., cache_feature_matrix=True)`
#### MultiModel
TODO describe multi models ...
## TODO
* replace hard coded summary objects by a summary provider function
* multi model is just another implementation of model
* add keras model
* add more tests
## Wanna help?
* currently I only need binary classification
* maybe you want to add a feature for multiple classes
* for non classification problems you might want to augment the `Summary`
* write some te

程序员Chino的日记
- 粉丝: 4224
最新资源
- 电力系统中虚拟同步发电机(VSG)转动惯量和阻尼系数自适应控制的并网仿真研究
- 永磁同步电机:VF控制、IF恒流频比控制与恒压频比控制的MATLAB(Simulink)仿真及全速域复合控制策略 文档
- COMSOL中基于水平集法和蠕动流模块的裂隙注浆过程模拟及其工程应用 多物理场耦合 权威版
- 光储并网直流微电网仿真模型:基于MPPT、储能电池与超级电容的控制策略优化
- 西门子WinCC报表控件:自定义模板、多格式导出与傻瓜式操作助力工业自动化报表管理
- 光伏储能VSG虚拟同步发电机三相并网Simulink模型:含MPPT扰动观察法追踪与一次调频功能 · VSG v3.0
- 电力电子领域中基于VIENNA拓扑的三相整流仿真模型及其电压电流双闭环控制策略 PI控制
- CSDN-《C++面向对象程序设计》.html
- Y011:基于优化算法和VMD的最优储能系统(包括VMD与储能、功率滑动平均滤波、Simulink及可选优化算法)”
- Comsol PEM电解槽非等温流动模型:基于双极板流道刻蚀形状与多物理场耦合的参数化建模及其应用 精华版
- 基于PSO优化的OFDM系统PAPR抑制PTS算法MATLAB仿真研究
- COMSOL模拟T型管气泡流动:水平集方法与两相流理论的应用
- COMSOL三维多物理场仿真:固液多相介质力热流耦合模拟及应用 有限元分析
- 如何获取IBMMQ所需的9个jar包下载资源
- Fluent在矿山工程中采空区数值模拟、瓦斯抽采与防灭火及UDF编程应用研究
- 基于comsol太赫兹超表面技术的BIC与能带折叠的深度探索
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈


