TensorFlow XLAの可能性

Deep Learning Acceleration 勉強会
2017/9/3
TensorFlow XLAの可能性
TensorFlow r1.3で導入さ
れたpluginにて、いろいろな
ハードウェアへの対応がで
きるようになる！
@Vengineer

CQ出版社：インターフェース 8月号、9月号に
TensorFlow XLAのAOTについての記事を書きました。
8月号：
衝撃的な性能UPの可能性を秘めた注目テクノロジ速報
AIをサクサク動かすGoogle新機能TensorFlow「XLA」を探る
9月号：
最新テクノロジ・マニアの挑戦 ...AIサクサク用TensorFlow XLA AOTコンパイラ探訪
初めてのGoogleソースコード！AI用コンパイラの可能性を探る
ブログ : Vengineerの戯言
http://blogs.yahoo.co.jp/verification_engineer
Twitter : ＠Vengineer
自己紹介

Design Solution Forum
2017年10月13日(金)開催@新横浜
今年で4年目、毎年500名を越える来場者
絶賛、申込受付中
http://www.dsforum.jp/
「ディーブラーニングトラック」
「RISC-Vトラック」
それぞれ5講演の予定

TensorFlow XLAとは
https://www.tensorflow.org/performance/xla/
XLA(Accelerated Linear Algebra)は、TensorFlow計算を最適化
する線形代数のドメイン固有のコンパイラです。結果として、サー
バーおよびモバイルプラットフォームでの速度、メモリ使用率、移植性
が向上します。当初、ほとんどのユーザーはXLAの大きなメリットは
見られませんが、JIT(Just-In-Time)コンパイルや
AOT(Ahead-Of-Time)コンパイルを使用してXLAを使用することで
実験を開始できます。新しいハードウェアアクセラレータをターゲット
とする開発者は、XLAを試すことを特にお勧めします。
原文(英語)をそのまま、Google翻訳にお願いしました。

TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance
https://autodiff-workshop.github.io/slides/JeffDean.pdf
XLA対応のデバイス

ブログにも書きました
TensorFlow XLAの衝撃
2017年2月20日
http://blogs.yahoo.co.jp/verification_engineer/71016304.html

TensorFlow User Group ハード部 #2
https://tfug-tokyo.connpass.com/event/54426/
TensorFlow XLAは、
中で何をやっているのか？
2017年4月21日
https://blogs.yahoo.co.jp/verification_engineer/71103781.html

CQ出版社インターフェース 8月号、9月号に
TensorFlow XLAのAOTについての記事を書きまし
た
8月号：
衝撃的な性能UPの可能性を秘めた注目テクノロジ速報
AIをサクサク動かすGoogle新機能TensorFlow「XLA」を探る
9月号：
最新テクノロジ・マニアの挑戦 ...AIサクサク用TensorFlow XLA AOTコンパイラ探訪
初めてのGoogleソースコード！AI用コンパイラの可能性を探る

また、ブログにも書きました
TensorFlow XLAに動きあり
2017年7月3日

日経エレクトロニクス、2017年9月号の記事、
ソニーも参戦、深層学習ソフト組み込み向けの開発環
境で競う
日経エレクトロニクスの
　　　　取材を受けましたよ
2017年8月19日

これからお話する内容
0)、Pythonの式からTensorFlowグラフが
　　どう変形されるかを見ていきます
1)、JIT (Just-In-Time) コンパイル
ただし、単一マシンのみで、GPUは1つ
2)、AOT (Ahead-Of-Time) コンパイル
CPUのみ
x86-64/ARM/AARCH64/PowerPC
CQ出版インターフェースを見てね！

Using JIT Compilation
https://www.tensorflow.org/performance/xla/jit
TensorFlow/XLA JITコンパイラは、XLAを使用してTensorFlowグ
ラフの一部をコンパイルして実行します。
この標準的なTensorFlow実装の利点は、XLAが複数の演算子(カー
ネル融合)を少数のコンパイル済みカーネルに融合できることです。
TensorFlow Executorsが実行するように、演算子を融合させること
で、メモリ帯域幅の要件を減らし、演算子を1つずつ実行するよりもパ
フォーマンスを向上させることができます。
原文(英語)をそのまま、Google翻訳にお願いしました。

サンプルコードで
確認してみよう

デバイスを gpu にすると
def test_gpu(self):
with tf.Session() as sess:
x = tf.placeholder(tf.float32, [2], name="x")
with tf.device("gpu"):
y = x * 2
result = sess.run(y, {x: [1.5, 0.5]})

Session.runの動き
python/client/session.py
SessionInterface => BaseSession => Session
def run( self, fetches, feed_dict=None,
options=None, run_metadata=None );
_run
　_do_run
　　tf_session.TF_Run
　ここからC++の世界
c/c_api.ccのTF_Run関数
　　　c/c_api.ccのTF_Run_Helper関数
　　　　　　Session::run (core/public/session.h)
　DirectSession::Run

C++のDirectSession::Run
DirectSession::Run (core/common_runtime/direct_session.cc)
Executorを生成する
GetOrCreateExecutors(pool, input_tensor_names,
output_names, target_nodes,
&executors_and_keys,
&run_state_args));
Executorは複数あり
各Executorが独立して実行し
各Executor間の通信は非同期に行われる

C++のDirectSession::Runの続き
DirectSession::Run (core/common_runtime/direct_session.cc)
実行部分のところ
for (const auto& item : executors_and_keys->items) {
item.executor->RunAsync(args, barrier->Get());
}　　Executorが非同期に実行される
すべてExecutorの実行が終了するまで待つ
WaitForNotification(&run_state, &step_cancellation_manager,
run_options.timeout_in_ms() > 0
? run_options.timeout_in_ms()
: operation_timeout_in_ms_);

executor->RunAsync
Executor::RunAync (core/common_runtime/executor.h)
ExecuteImple::RunAsync
ExecuteState::RunAsync
ExecuteState::ScheduleReady
ExecuteState::Process (core/common_runtime/executor.cc)
　・device->Compute　
ここ、後で出てくるので覚えておいてね！

0)、最初
Mul
Const
Feed(x)
Fetch(y)

1)、Feed/Fetchノードの追加
Mul
_Recv
Const
_Send
Feed(x)
Fetch(y)

2)、Placement
Mul
_Recv
Const
_Send
cpu : Feed(x)
cpu : Fetch(y)
gpu
gpu

3)、グラフの分割
_Recv
_Send
_Send _Recv _Send
gpu
Feed(x) Fetch(y)cpu
Mul
Const
_Recv

gpu を XLA_GPU に変更
def testXLA_JIT(self):
with tf.Session() as sess:
x = tf.placeholder(tf.float32, [2], name="x")
with tf.device("device:XLA_GPU:0"):
y = x * 2
result = sess.run(y, {x: [1.5, 0.5]})

2)、Placement
Mul
_Recv
Const
_Send
cpu : Feed(x)
cpu : Fetch(y)
XLA_GPU
XLA_GPU

_Recv
_Send
_Send _Recv _Send
XLA_GPU
Feed(x) Fetch(y)cpu
Mul
Const
_Recv

_XlaLaunch
_Recv
_Recv _Send
_Send _Recv _Send
XLA_GPU
Feed(x) Fetch(y)cpu

複数Opsを_XlaLaunch Opに変換
_XlaLaunch
XLA_GPU
MulConst
gpu

ええええ、
なんで、_XlaLaunch
になっちゃうの？
どうして？

TensorFlow XLA : JITでは！
同じデバイス内で実行できるSubgraph単位の
ノードをギュギュッと1つにまとめて、
_XlaLaunch Op
内で実行する
_XlaLaunchは、
TensorFlow XLA専用のOpとして実装

Adding a New Op
https://www.tensorflow.org/versions/master/how_tos/adding_an_op/
必要なものは、
　・Register the new Op in a C++ file
　・Implement the Op in C++
　・Optionally, create a Python wrapper
　・Optionally, write a function to compute gradients for the Op
　・Test the Op, typically in Python

_XlaLaunch Opで実装は？
・Register the new Op in a C++ file
・Implement the Op in C++
compiler/jit/kernels/xla_local_launch_op.h
compiler/jit/kernels/xla_local_launch_op.cc

_XlaLaunch Op の登録
REGISTER_OP("_XlaLaunch")
.Input("constants: Tconstants")
.Attr("Tconstants: list(type) >= 0")
.Input("args: Targs")
.Attr("Targs: list(type) >= 0")
.Output("results: Tresults")
.Attr("Tresults: list(type) >= 0")
.Attr("function: func")
.Doc("XLA Launch Op. For use by the XLA JIT only.");

_XlaLaunch Op の実装
class XlaDeviceLaunchOp : public OpKernel {
public:
explicit XlaDeviceLaunchOp(OpKernelConstruction* ctx);
~XlaDeviceLaunchOp() override;
void Compute(OpKernelContext* ctx) override;
覚えていましたか？ device->Compute ですよ！
private:
....
TF_DISALLOW_COPY_AND_ASSIGN(XlaDevicelLaunchOp);
};

XlaDeviceLaunchOp::Compute
　・XlaCompilationCacheクラスのインスタンス(compiler)を生成
　・_XlaLaunch Op内で実行する一連の関数群をコンパイル
cache>Compile( …. );
　・各種パラメータ＆入力リストをXLA用データに変換
　・キャッシュの生成＆実行
cache->client()->Execute(.....);
　
　・XLA用データを出力リストに変換

XlaLocalLaunchOp::Computeの処理
ここに LLVM を使っている
cache->Compile cache->cilent()->Execute

Plugin
Intel Nervana
Graphcore
もXLAをサポートするかも？

Intel® Nervana™ Graph Beta
2017-06-22
TensorFlow/XLA Support
https://www.intelnervana.com/intel-nervana-graph-and-neon-3-0-updates/
Intel® Nervana™ Graph: A Universal Tensor JIT Compiler Webinar
https://software.seek.intel.com/IntelNervanaGraphWebinar_Reg
詳細は、このWebinarの資料を見てね！

Graphcore
https://www.nextplatform.com/2017/05/08/dive-deep-learning-chip-startup-graphcores-software-stack/
TensorFlow XLAのPluginのオリジナルコードは、Graphcore
TensorFlow:Remove copyright on non-poplar files
https://github.com/tensorflow/tensorflow/commit/679152e2c13229db9386fe5c3a267e63d0093889

TensorFlow XLA Google Group
https://groups.google.com/forum/m/#!forum/xla-dev
Graphcore
Intel Nervana
Knuedge
から投稿あり
当然、Googleの中の人も

compiler/plugin/executor
　・BUILD
・device.cc
・compiler.{cc, h}
・executable.{cc, h}
・executor.{cc, h}
・platform.{cc, h}
・platform_id.h
・transfer_manager.{cc, h}

XlaDeviceLaunchOp::Computeの処理
compiler.cc executable.{h,cc}
executor.{h,cc}

XLA_EXECの登録 (device.cc)
const char* const DEVICE_XLA_EXEC = "XLA_EXEC";
const char* const DEVICE_EXEC_XLA_JIT =
"XLA_EXEC_JIT";
constexpr std::array<DataType, 5> kExecAllTypes = {
{DT_INT32, DT_FLOAT, DT_BOOL, DT_DOUBLE, DT_INT64}};
class XlaExaDeviceFactory : public DeviceFactory {
public:
Status CreateDevices(const SessionOptions& options, const
string& name_prefix,
std::vector<Device*>* devices) override;

REGISTER_LOCAL_DEVICE_FACTORY(
　　DEVICE_XLA_EXEC, XlaExaDeviceFactory, 40);
constexpr std::array<DataType, 5> kAllXlaCpuTypes = {{
　　DT_INT32, DT_INT64, DT_FLOAT,
　　DT_DOUBLE, DT_BOOL}};
REGISTER_XLA_LAUNCH_KERNEL(
　　DEVICE_XLA_EXEC, XlaDeviceLaunchOp, kExecAllTypes);
REGISTER_XLA_DEVICE_KERNELS(
　　DEVICE_XLA_EXEC, kExecAllTypes);

デバイスの登録
core/common_runtime/device_factory.{h,c}
// The default priority values for built-in devices is:
// GPU: 210
// SYCL: 200
// GPUCompatibleCPU: 70
// ThreadPoolDevice: 60
// Default: 50
REGISTER_LOCAL_DEVICE_FACTORYマクロで設定する

REGISTER_XLA_BACKEND(
DEVICE_EXEC_XLA_JIT, kExecAllTypes, OpFilter);
tf2xla/xla_op_registry.h に r1.2で追加された
// REGISTER_XLA_BACKEND() registers an XLA backend. Example usage:
// REGISTER_XLA_BACKEND(DEVICE_GPU_XLA_JIT, kGpuAllTypes, GpuOpFilter);
#define REGISTER_XLA_BACKEND(NAME, ...)
REGISTER_XLA_BACKEND_UNIQ_HELPER(__COUNTER__, NAME, __VA_ARGS__)

Compile
plugin/executor/compiler.{h,c}
RunHloOptimization : HLOの最適化
// Typically you would visit the HLO graph, building up a compiled equivalent
// In this case we are using an Hlo evaluator at execution time, so we don't
// need to compile anything
// ここでPluginに対応したコード生成を行う
ExecutorExecutableの生成

引用
：https://raw.githubusercontent.com/aws/aws-fpga/master/hdk/docs/images/AWS_FPGA_Software_
Overview.jpg
AWS EC2 F1 でもできるかな？

https://www.nextplatform.com/2017/08/23/first-depth-view-wave-computings-dpu-architecture-systems/
Wave Computing

http://tvmlang.org/2017/08/17/tvm-release-announcement.html
MXnet-NVVM-TVM
LLVMはCPU、CUDAは別

https://www.nextplatform.com/2017/08/24/drilling-microsofts-brainwave-soft-deep-leaning-chip/
Microsoft BrainWave
推論で、バッチサイズは、1

TensorFlow XLAの可能性
以上、説明してきたように、
TensorFlow r1.3で導入されたpluginにて、
他のフレームワークではできない
いろいろなハードウェアへの対応が
できるようになる！
そこに注目しました！

SlideShareで公開しています
https://www.slideshare.net/ssuser479fa3
TensroFlow XLA : JIT編 (r1.3版)
https://www.slideshare.net/ssuser479fa3/tensroflow-xla-jit
Intel Nervana Graph とは？
https://www.slideshare.net/ssuser479fa3/intel-nervana-graph-compiler
DSPでディープラーニング
https://www.slideshare.net/ssuser479fa3/dsp-75659146

ありがとうございました
ブログ : Vengineerの戯言
http://blogs.yahoo.co.jp/verification_engineer
Twitter : ＠Vengineer
FPGAエクストリーム・コンピューティング
第9回
　　　　　　　　2017年9月24日

TensorFlow XLAの可能性

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to TensorFlow XLAの可能性 (20)

More from Mr. Vengineer (20)

TensorFlow XLAの可能性