{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "FEL3NlTTDlSX" }, "source": [ "##### Copyright 2021 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2023-11-07T19:58:53.321709Z", "iopub.status.busy": "2023-11-07T19:58:53.320985Z", "iopub.status.idle": "2023-11-07T19:58:53.325284Z", "shell.execute_reply": "2023-11-07T19:58:53.324628Z" }, "id": "FlUw7tSKbtg4" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "77z2OchJTk0l" }, "source": [ "# 调试 TensorFlow 2 迁移的训练流水线\n", "\n", "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码\n", " 下载笔记本
" ] }, { "cell_type": "markdown", "metadata": { "id": "zTwPu-w6M5sz" }, "source": [ "此笔记本演示了如何在迁移到 TensorFlow 2 (TF2) 时调试训练流水线。它由以下组件组成:\n", "\n", "1. 调试训练流水线的建议步骤和代码示例\n", "2. 用于调试的工具\n", "3. 其他相关资源\n", "\n", "一个假设是您有用于比较的 TensorFlow 1 (TF1.x) 代码和已训练模型,并且您希望构建一个 TF2 模型来实现类似的验证准确率。\n", "\n", "此笔记本**不**涵盖有关训练/推断速度或内存使用量的调试性能问题。" ] }, { "cell_type": "markdown", "metadata": { "id": "fKm9R4CtOAP3" }, "source": [ "## 调试工作流\n", "\n", "下面是调试 TF2 训练流水线的一般工作流。请注意,您不需要按顺序执行这些步骤。您也可以使用二分查找方法,在中间步骤中测试模型并缩小调试范围。\n", "\n", "1. 修复编译和运行时错误\n", "\n", "2. 单次前向传递验证(在单独的[指南](./validate_correctness.ipynb)中)\n", "\n", " a. 在单个 CPU 设备上\n", "\n", " - 验证变量是否只创建一次\n", " - 检查变量计数、名称和形状是否匹配\n", " - 重置所有变量,在停用所有随机性的情况下检查数值等价性\n", " - 对齐随机数生成,检查推断中的数值等价性\n", " - (可选)检查检查点已正确加载,TF1.x/TF2 模型生成相同的输出\n", "\n", " b. 在单个 GPU/TPU 设备上\n", "\n", " c. 采用多设备策略\n", "\n", "3. 几个步骤的模型训练数值等价性验证(下面提供代码示例)\n", "\n", " a. 在单个 CPU 设备上使用少量固定数据进行单次训练步骤验证。具体来说,检查以下组件的数值等价性\n", "\n", " - 损失计算\n", " - 指标\n", " - 学习率\n", " - 梯度计算和更新\n", "\n", " b. 在训练 3 个或更多步骤后检查统计数据,验证优化器的行为(如动量)在单个 CPU 设备上是否仍然使用固定数据\n", "\n", " c. 在单个 GPU/TPU 设备上\n", "\n", " d. 使用多设备策略(查看底部 [MultiProcessRunner](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/multi_process_runner.py#L108) 的介绍)\n", "\n", "4. 对真实数据集的端到端收敛测试\n", "\n", " a. 使用 TensorBoard 检查训练行为\n", "\n", " - 首先使用 SGD 等简单的优化器和 `tf.distribute.OneDeviceStrategy` 等简单的分布策略\n", " - 训练指标\n", " - 评估指标\n", " - 找出对固有随机性的合理容忍度是多少\n", "\n", " b. 使用高级优化器/学习率调度器/分布策略检查等价性\n", "\n", " c. 使用混合精度时检查等价性\n", "\n", "5. 附加产品基准" ] }, { "cell_type": "markdown", "metadata": { "id": "XKakQBI9-FLb" }, "source": [ "## 安装" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:58:53.329328Z", "iopub.status.busy": "2023-11-07T19:58:53.328931Z", "iopub.status.idle": "2023-11-07T19:59:21.924707Z", "shell.execute_reply": "2023-11-07T19:59:21.923786Z" }, "id": "i1ghHyXl-Oqd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\r\n", "tensorflow-datasets 4.9.3 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\r\n", "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\r\n", "\u001b[0m" ] } ], "source": [ "# The `DeterministicRandomTestTool` is only available from Tensorflow 2.8:\n", "!pip install -q \"tensorflow==2.9.*\"" ] }, { "cell_type": "markdown", "metadata": { "id": "usyRSlIRl3r2" }, "source": [ "### 单前向传递验证\n", "\n", "单前向传递验证(包括检查点加载)将在不同的 [colab](./validate_correctness.ipynb) 中介绍。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:21.929277Z", "iopub.status.busy": "2023-11-07T19:59:21.928943Z", "iopub.status.idle": "2023-11-07T19:59:24.192347Z", "shell.execute_reply": "2023-11-07T19:59:24.191496Z" }, "id": "HVBQbsZeVL_V" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-11-07 19:59:22.275296: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n" ] } ], "source": [ "import sys\n", "import unittest\n", "import numpy as np\n", "\n", "import tensorflow as tf\n", "import tensorflow.compat.v1 as v1" ] }, { "cell_type": "markdown", "metadata": { "id": "4M104dt7m5cC" }, "source": [ "### 几个步骤的模型训练数值等价性验证" ] }, { "cell_type": "markdown", "metadata": { "id": "v2Nz2Ni1EkMz" }, "source": [ "设置模型配置并准备一个假数据集。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:24.197049Z", "iopub.status.busy": "2023-11-07T19:59:24.196574Z", "iopub.status.idle": "2023-11-07T19:59:24.202189Z", "shell.execute_reply": "2023-11-07T19:59:24.201524Z" }, "id": "hUxXadzKU9rT" }, "outputs": [], "source": [ "params = {\n", " 'input_size': 3,\n", " 'num_classes': 3,\n", " 'layer_1_size': 2,\n", " 'layer_2_size': 2,\n", " 'num_train_steps': 100,\n", " 'init_lr': 1e-3,\n", " 'end_lr': 0.0,\n", " 'decay_steps': 1000,\n", " 'lr_power': 1.0,\n", "}\n", "\n", "# make a small fixed dataset\n", "fake_x = np.ones((2, params['input_size']), dtype=np.float32)\n", "fake_y = np.zeros((2, params['num_classes']), dtype=np.int32)\n", "fake_y[0][0] = 1\n", "fake_y[1][1] = 1\n", "\n", "step_num = 3" ] }, { "cell_type": "markdown", "metadata": { "id": "lV_n3Ukmz4Un" }, "source": [ "定义 TF1.x 模型。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:24.205710Z", "iopub.status.busy": "2023-11-07T19:59:24.205460Z", "iopub.status.idle": "2023-11-07T19:59:24.216387Z", "shell.execute_reply": "2023-11-07T19:59:24.215688Z" }, "id": "ATa5fzL8mAwl" }, "outputs": [], "source": [ "# Assume there is an existing TF1.x model using estimator API\n", "# Wrap the model_fn to log necessary tensors for result comparison\n", "class SimpleModelWrapper():\n", " def __init__(self):\n", " self.logged_ops = {}\n", " self.logs = {\n", " 'step': [],\n", " 'lr': [],\n", " 'loss': [],\n", " 'grads_and_vars': [],\n", " 'layer_out': []}\n", " \n", " def model_fn(self, features, labels, mode, params):\n", " out_1 = tf.compat.v1.layers.dense(features, units=params['layer_1_size'])\n", " out_2 = tf.compat.v1.layers.dense(out_1, units=params['layer_2_size'])\n", " logits = tf.compat.v1.layers.dense(out_2, units=params['num_classes'])\n", " loss = tf.compat.v1.losses.softmax_cross_entropy(labels, logits)\n", "\n", " # skip EstimatorSpec details for prediction and evaluation \n", " if mode == tf.estimator.ModeKeys.PREDICT:\n", " pass\n", " if mode == tf.estimator.ModeKeys.EVAL:\n", " pass\n", " assert mode == tf.estimator.ModeKeys.TRAIN\n", "\n", " global_step = tf.compat.v1.train.get_or_create_global_step()\n", " lr = tf.compat.v1.train.polynomial_decay(\n", " learning_rate=params['init_lr'],\n", " global_step=global_step,\n", " decay_steps=params['decay_steps'],\n", " end_learning_rate=params['end_lr'],\n", " power=params['lr_power'])\n", " \n", " optmizer = tf.compat.v1.train.GradientDescentOptimizer(lr)\n", " grads_and_vars = optmizer.compute_gradients(\n", " loss=loss,\n", " var_list=graph.get_collection(\n", " tf.compat.v1.GraphKeys.TRAINABLE_VARIABLES))\n", " train_op = optmizer.apply_gradients(\n", " grads_and_vars,\n", " global_step=global_step)\n", " \n", " # log tensors\n", " self.logged_ops['step'] = global_step\n", " self.logged_ops['lr'] = lr\n", " self.logged_ops['loss'] = loss\n", " self.logged_ops['grads_and_vars'] = grads_and_vars\n", " self.logged_ops['layer_out'] = {\n", " 'layer_1': out_1,\n", " 'layer_2': out_2,\n", " 'logits': logits}\n", "\n", " return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)\n", "\n", " def update_logs(self, logs):\n", " for key in logs.keys():\n", " model_tf1.logs[key].append(logs[key])" ] }, { "cell_type": "markdown", "metadata": { "id": "kki9yILSKS7f" }, "source": [ "以下 [`v1.keras.utils.DeterministicRandomTestTool`](https://tensorflow.google.cn/api_docs/python/tf/compat/v1/keras/utils/DeterministicRandomTestTool) 类提供了一个上下文管理器 `scope()`,它可以使有状态的随机运算在 TF1 计算图/会话和 Eager Execution 中使用相同的种子。\n", "\n", "此工具提供两种测试模式:\n", "\n", "1. `constant`,无论被调用过多少次,都会为每个单一运算使用相同的种子,以及\n", "2. `num_random_ops`,使用先前观测到的有状态随机运算的数量作为运算种子。\n", "\n", "这既适用于用于创建和初始化变量的有状态随机运算,也适用于计算中使用的有状态随机运算(例如用于随机失活层)。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:24.219711Z", "iopub.status.busy": "2023-11-07T19:59:24.219460Z", "iopub.status.idle": "2023-11-07T19:59:24.532035Z", "shell.execute_reply": "2023-11-07T19:59:24.531336Z" }, "id": "X6Y3RWMoKOl8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/tmp/ipykernel_234052/2689227634.py:1: The name tf.keras.utils.DeterministicRandomTestTool is deprecated. Please use tf.compat.v1.keras.utils.DeterministicRandomTestTool instead.\n", "\n" ] } ], "source": [ "random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')" ] }, { "cell_type": "markdown", "metadata": { "id": "mk5-ZzxcErX5" }, "source": [ "在计算图模式下运行 TF1.x 模型。收集前 3 个训练步骤的统计数据以进行数值等价性比较。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:24.536063Z", "iopub.status.busy": "2023-11-07T19:59:24.535761Z", "iopub.status.idle": "2023-11-07T19:59:25.438355Z", "shell.execute_reply": "2023-11-07T19:59:25.437228Z" }, "id": "r5zhJHvsWA24" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-11-07 19:59:25.069458: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n", "2023-11-07 19:59:25.069585: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory\n", "2023-11-07 19:59:25.069663: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory\n", "2023-11-07 19:59:25.069737: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory\n", "2023-11-07 19:59:25.149993: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory\n", "2023-11-07 19:59:25.150247: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n", "Skipping registering GPU devices...\n", "/tmpfs/tmp/ipykernel_234052/1984550333.py:14: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n", " out_1 = tf.compat.v1.layers.dense(features, units=params['layer_1_size'])\n", "/tmpfs/tmp/ipykernel_234052/1984550333.py:15: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n", " out_2 = tf.compat.v1.layers.dense(out_1, units=params['layer_2_size'])\n", "/tmpfs/tmp/ipykernel_234052/1984550333.py:16: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n", " logits = tf.compat.v1.layers.dense(out_2, units=params['num_classes'])\n" ] } ], "source": [ "with random_tool.scope():\n", " graph = tf.Graph()\n", " with graph.as_default(), tf.compat.v1.Session(graph=graph) as sess:\n", " model_tf1 = SimpleModelWrapper()\n", " # build the model\n", " inputs = tf.compat.v1.placeholder(tf.float32, shape=(None, params['input_size']))\n", " labels = tf.compat.v1.placeholder(tf.float32, shape=(None, params['num_classes']))\n", " spec = model_tf1.model_fn(inputs, labels, tf.estimator.ModeKeys.TRAIN, params)\n", " train_op = spec.train_op\n", "\n", " sess.run(tf.compat.v1.global_variables_initializer())\n", " for step in range(step_num):\n", " # log everything and update the model for one step\n", " logs, _ = sess.run(\n", " [model_tf1.logged_ops, train_op],\n", " feed_dict={inputs: fake_x, labels: fake_y})\n", " model_tf1.update_logs(logs)" ] }, { "cell_type": "markdown", "metadata": { "id": "eZxjI8Nxz9Ea" }, "source": [ "定义 TF2 模型。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:25.442887Z", "iopub.status.busy": "2023-11-07T19:59:25.442589Z", "iopub.status.idle": "2023-11-07T19:59:25.453913Z", "shell.execute_reply": "2023-11-07T19:59:25.453186Z" }, "id": "AA67rh2TkS1M" }, "outputs": [], "source": [ "class SimpleModel(tf.keras.Model):\n", " def __init__(self, params, *args, **kwargs):\n", " super(SimpleModel, self).__init__(*args, **kwargs)\n", " # define the model\n", " self.dense_1 = tf.keras.layers.Dense(params['layer_1_size'])\n", " self.dense_2 = tf.keras.layers.Dense(params['layer_2_size'])\n", " self.out = tf.keras.layers.Dense(params['num_classes'])\n", " learning_rate_fn = tf.keras.optimizers.schedules.PolynomialDecay(\n", " initial_learning_rate=params['init_lr'],\n", " decay_steps=params['decay_steps'],\n", " end_learning_rate=params['end_lr'],\n", " power=params['lr_power']) \n", " self.optimizer = tf.keras.optimizers.legacy.SGD(learning_rate_fn)\n", " self.compiled_loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)\n", " self.logs = {\n", " 'lr': [],\n", " 'loss': [],\n", " 'grads': [],\n", " 'weights': [],\n", " 'layer_out': []}\n", "\n", " def call(self, inputs):\n", " out_1 = self.dense_1(inputs)\n", " out_2 = self.dense_2(out_1)\n", " logits = self.out(out_2)\n", " # log output features for every layer for comparison\n", " layer_wise_out = {\n", " 'layer_1': out_1,\n", " 'layer_2': out_2,\n", " 'logits': logits}\n", " self.logs['layer_out'].append(layer_wise_out)\n", " return logits\n", "\n", " def train_step(self, data):\n", " x, y = data\n", " with tf.GradientTape() as tape:\n", " logits = self(x)\n", " loss = self.compiled_loss(y, logits)\n", " grads = tape.gradient(loss, self.trainable_weights)\n", " # log training statistics\n", " step = self.optimizer.iterations.numpy()\n", " self.logs['lr'].append(self.optimizer.learning_rate(step).numpy())\n", " self.logs['loss'].append(loss.numpy())\n", " self.logs['grads'].append(grads)\n", " self.logs['weights'].append(self.trainable_weights)\n", " # update model\n", " self.optimizer.apply_gradients(zip(grads, self.trainable_weights))\n", " return" ] }, { "cell_type": "markdown", "metadata": { "id": "I5smAcaEE8nX" }, "source": [ "在 Eager 模式下运行 TF2 模型。收集前 3 个训练步骤的统计数据以进行数值等价性比较。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:25.457910Z", "iopub.status.busy": "2023-11-07T19:59:25.457419Z", "iopub.status.idle": "2023-11-07T19:59:25.516581Z", "shell.execute_reply": "2023-11-07T19:59:25.515677Z" }, "id": "Q0AbXF_eE8cS" }, "outputs": [], "source": [ "random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n", "with random_tool.scope():\n", " model_tf2 = SimpleModel(params)\n", " for step in range(step_num):\n", " model_tf2.train_step([fake_x, fake_y])" ] }, { "cell_type": "markdown", "metadata": { "id": "cjJDjLcAz_gU" }, "source": [ "比较前几个训练步骤的数值等价性。\n", "\n", "您还可以查看[验证正确性和数值等价性笔记本](./validate_correctness.ipynb)以获得关于数值等价性的额外建议。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:25.521396Z", "iopub.status.busy": "2023-11-07T19:59:25.520967Z", "iopub.status.idle": "2023-11-07T19:59:25.561272Z", "shell.execute_reply": "2023-11-07T19:59:25.560102Z" }, "id": "6CbCUbsCiabC" }, "outputs": [], "source": [ "np.testing.assert_allclose(model_tf1.logs['lr'], model_tf2.logs['lr'])\n", "np.testing.assert_allclose(model_tf1.logs['loss'], model_tf2.logs['loss'])\n", "for step in range(step_num):\n", " for name in model_tf1.logs['layer_out'][step]:\n", " np.testing.assert_allclose(\n", " model_tf1.logs['layer_out'][step][name],\n", " model_tf2.logs['layer_out'][step][name])" ] }, { "cell_type": "markdown", "metadata": { "id": "dhVuuciimLIY" }, "source": [ "#### 单元测试" ] }, { "cell_type": "markdown", "metadata": { "id": "sXZYFC6Hhqeb" }, "source": [ "有几种类型的单元测试可以帮助调试迁移代码。\n", "\n", "1. 单前向传递验证\n", "2. 几个步骤的模型训练数值等价性验证\n", "3. 基准推断性能\n", "4. 已训练模型对固定和简单的数据点做出正确预测\n", "\n", "可以使用 `@parameterized.parameters` 来测试具有不同配置的模型。[包含代码示例的详细信息](https://github.com/abseil/abseil-py/blob/master/absl/testing/parameterized.py)。\n", "\n", "请注意,可以在同一个测试用例中运行会话 API 和 Eager Execution。下面的代码段显示了具体方式。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:25.566564Z", "iopub.status.busy": "2023-11-07T19:59:25.565894Z", "iopub.status.idle": "2023-11-07T19:59:25.576572Z", "shell.execute_reply": "2023-11-07T19:59:25.575776Z" }, "id": "CdHqkgPPM2Bj" }, "outputs": [], "source": [ "import unittest\n", "\n", "class TestNumericalEquivalence(unittest.TestCase):\n", "\n", " # copied from code samples above\n", " def setup(self):\n", " # record statistics for 100 training steps\n", " step_num = 100\n", "\n", " # setup TF 1 model\n", " random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n", " with random_tool.scope():\n", " # run TF1.x code in graph mode with context management\n", " graph = tf.Graph()\n", " with graph.as_default(), tf.compat.v1.Session(graph=graph) as sess:\n", " self.model_tf1 = SimpleModelWrapper()\n", " # build the model\n", " inputs = tf.compat.v1.placeholder(tf.float32, shape=(None, params['input_size']))\n", " labels = tf.compat.v1.placeholder(tf.float32, shape=(None, params['num_classes']))\n", " spec = self.model_tf1.model_fn(inputs, labels, tf.estimator.ModeKeys.TRAIN, params)\n", " train_op = spec.train_op\n", "\n", " sess.run(tf.compat.v1.global_variables_initializer())\n", " for step in range(step_num):\n", " # log everything and update the model for one step\n", " logs, _ = sess.run(\n", " [self.model_tf1.logged_ops, train_op],\n", " feed_dict={inputs: fake_x, labels: fake_y})\n", " self.model_tf1.update_logs(logs)\n", "\n", " # setup TF2 model\n", " random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n", " with random_tool.scope():\n", " self.model_tf2 = SimpleModel(params)\n", " for step in range(step_num):\n", " self.model_tf2.train_step([fake_x, fake_y])\n", " \n", " def test_learning_rate(self):\n", " np.testing.assert_allclose(\n", " self.model_tf1.logs['lr'],\n", " self.model_tf2.logs['lr'])\n", "\n", " def test_training_loss(self):\n", " # adopt different tolerance strategies before and after 10 steps\n", " first_n_step = 10\n", "\n", " # absolute difference is limited below 1e-5\n", " # set `equal_nan` to be False to detect potential NaN loss issues\n", " abosolute_tolerance = 1e-5\n", " np.testing.assert_allclose(\n", " actual=self.model_tf1.logs['loss'][:first_n_step],\n", " desired=self.model_tf2.logs['loss'][:first_n_step],\n", " atol=abosolute_tolerance,\n", " equal_nan=False)\n", " \n", " # relative difference is limited below 5%\n", " relative_tolerance = 0.05\n", " np.testing.assert_allclose(self.model_tf1.logs['loss'][first_n_step:],\n", " self.model_tf2.logs['loss'][first_n_step:],\n", " rtol=relative_tolerance,\n", " equal_nan=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "gshSQdKIddpZ" }, "source": [ "## 调试工具" ] }, { "cell_type": "markdown", "metadata": { "id": "CkMfCaJRclKv" }, "source": [ "### tf.print\n", "\n", "tf.print 与 print/logging.info\n", "\n", "- 利用可配置参数,`tf.print` 能够以递归方式显示打印张量的每个维度的前几个和最后几个元素。请查看 [API 文档](https://tensorflow.google.cn/api_docs/python/tf/print)以了解详情。\n", "- 对于 Eager Execution,`print` 和 `tf.print` 都会打印张量的值。但 `print` 可能涉及设备到主机的复制,这可能会减慢代码速度。\n", "- 对于包含 `tf.function` 内用法的计算图模式,您需要使用 `tf.print` 打印实际张量值。`tf.print` 被编译成计算图中的一个运算,而 `print` 和 `logging.info` 只在跟踪时记录,这通常不是您希望的。\n", "- `tf.print` 还支持打印复合张量,如 `tf.RaggedTensor` 和 `tf.sparse.SparseTensor`。\n", "- 您还可以使用回调来监视指标和变量。请检查如何使用带有[日志字典](https://tensorflow.google.cn/guide/keras/custom_callback#usage_of_logs_dict)和 [self.model 特性](https://tensorflow.google.cn/guide/keras/custom_callback#usage_of_selfmodel_attribute)的自定义回调。" ] }, { "cell_type": "markdown", "metadata": { "id": "S-5h3cX8Dc50" }, "source": [ "tf.print 与 tf.function 内的 print" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:25.580441Z", "iopub.status.busy": "2023-11-07T19:59:25.579896Z", "iopub.status.idle": "2023-11-07T19:59:25.650048Z", "shell.execute_reply": "2023-11-07T19:59:25.649208Z" }, "id": "gRED9FMyDKih" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tensor(\"add:0\", shape=(1,), dtype=float32)\n", "[2]\n" ] } ], "source": [ "# `print` prints info of tensor object\n", "# `tf.print` prints the tensor value\n", "@tf.function\n", "def dummy_func(num):\n", " num += 1\n", " print(num)\n", " tf.print(num)\n", " return num\n", "\n", "_ = dummy_func(tf.constant([1.0]))\n", "\n", "# Output:\n", "# Tensor(\"add:0\", shape=(1,), dtype=float32)\n", "# [2]" ] }, { "cell_type": "markdown", "metadata": { "id": "3QroLA_zDK2w" }, "source": [ "tf.distribute.Strategy\n", "\n", "- 如果包含 `tf.print` 的 `tf.function` 在工作进程上执行,例如使用 `TPUStrategy` 或 `ParameterServerStrategy` 时,您需要检查工作进程/参数服务器日志以查找打印的值。\n", "- 对于 `print` 或 `logging.info`,使用 `ParameterServerStrategy` 时将在协调器上打印日志,使用 TPU 时将在 worker0 的 STDOUT 上打印日志。\n", "\n", "tf.keras.Model\n", "\n", "- 使用序列式和函数式 API 模型时,如果您想在某些层之后打印值,例如模型输入或中间特征,您可以选择以下选项。\n", " 1. [编写一个自定义层](https://tensorflow.google.cn/guide/keras/custom_layers_and_models),该层使用 `tf.print` 打印输入。\n", " 2. 在模型输出中包含要检查的中间输出。\n", "- `tf.keras.layers.Lambda` 层具有(反)序列化限制。为避免检查点加载问题,请改为编写自定义子类化层。请参阅 [API 文档](https://tensorflow.google.cn/api_docs/python/tf/keras/layers/Lambda),了解更多详细信息。\n", "- 如果您无权访问实际值,则无法在 `tf.keras.callbacks.LambdaCallback` 中使用 `tf.print` 打印中间输出,而只能访问符号 Keras 张量对象。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "aKazGTr1ZUMG" }, "source": [ "选项 1:编写一个自定义层" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:25.654207Z", "iopub.status.busy": "2023-11-07T19:59:25.653928Z", "iopub.status.idle": "2023-11-07T19:59:26.102988Z", "shell.execute_reply": "2023-11-07T19:59:26.102296Z" }, "id": "8w4aY7wO0B4W" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-0.327884018]\n", " [-0.109294683]\n", " [-0.218589365]]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/1 [==============================] - ETA: 0s - loss: 0.6077" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "1/1 [==============================] - 0s 319ms/step - loss: 0.6077\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class PrintLayer(tf.keras.layers.Layer):\n", " def call(self, inputs):\n", " tf.print(inputs)\n", " return inputs\n", "\n", "def get_model():\n", " inputs = tf.keras.layers.Input(shape=(1,))\n", " out_1 = tf.keras.layers.Dense(4)(inputs)\n", " out_2 = tf.keras.layers.Dense(1)(out_1)\n", " # use custom layer to tf.print intermediate features\n", " out_3 = PrintLayer()(out_2)\n", " model = tf.keras.Model(inputs=inputs, outputs=out_3)\n", " return model\n", "\n", "model = get_model()\n", "model.compile(optimizer=\"adam\", loss=\"mse\")\n", "model.fit([1, 2, 3], [0.0, 0.0, 1.0])" ] }, { "cell_type": "markdown", "metadata": { "id": "KNESOatq7iM9" }, "source": [ "选项 2:在模型输出中包含要检查的中间输出。\n", "\n", "请注意,在这种情况下,您可能需要进行一些[自定义](https://tensorflow.google.cn/guide/keras/customizing_what_happens_in_fit)才能使用 `Model.fit`。" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2023-11-07T19:59:26.108865Z", "iopub.status.busy": "2023-11-07T19:59:26.108569Z", "iopub.status.idle": "2023-11-07T19:59:26.113371Z", "shell.execute_reply": "2023-11-07T19:59:26.112678Z" }, "id": "MiifvdLk7g9J" }, "outputs": [], "source": [ "def get_model():\n", " inputs = tf.keras.layers.Input(shape=(1,))\n", " out_1 = tf.keras.layers.Dense(4)(inputs)\n", " out_2 = tf.keras.layers.Dense(1)(out_1)\n", " # include intermediate values in model outputs\n", " model = tf.keras.Model(\n", " inputs=inputs,\n", " outputs={\n", " 'inputs': inputs,\n", " 'out_1': out_1,\n", " 'out_2': out_2})\n", " return model" ] }, { "cell_type": "markdown", "metadata": { "id": "MvIKDZpHSLmQ" }, "source": [ "### pdb\n", "\n", "可以在终端和 Colab 中使用 [pdb](https://docs.python.org/3/library/pdb.html) 来检查中间值以进行调试。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Qu0n4O2umyT7" }, "source": [ "### 使用 TensorBoard 呈现计算图\n", "\n", "可以[使用 TensorBoard 检查 TensorFlow 计算图](https://tensorflow.google.cn/tensorboard/graphs)。[Colab 上也支持](https://tensorflow.google.cn/tensorboard/tensorboard_in_notebooks) TensorBoard。TensorBoard 是呈现摘要的绝佳工具。您可以利用它来比较训练过程中的学习率、模型权重、梯度尺度、训练/验证指标,甚至是 TF1.x 模型和迁移的 TF2 模型之间的模型中间输出,并查看值是否符合预期。" ] }, { "cell_type": "markdown", "metadata": { "id": "vBnxB6_xzlnT" }, "source": [ "### TensorFlow Profiler\n", "\n", "[TensorFlow Profiler](https://tensorflow.google.cn/guide/profiler) 可以帮助您呈现 GPU/TPU 上的执行时间线。可以查看此 [Colab 演示](https://tensorflow.google.cn/tensorboard/tensorboard_profiling_keras)来了解其基本用法。" ] }, { "cell_type": "markdown", "metadata": { "id": "9wNmCSHBpiGM" }, "source": [ "### MultiProcessRunner\n", "\n", "在使用 MultiWorkerMirroredStrategy 和 ParameterServerStrategy 进行调试时,[MultiProcessRunner](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/multi_process_runner.py#L108) 是一个实用的工具。可以查看[此具体示例](https://github.com/keras-team/keras/blob/master/keras/integration_test/mwms_multi_process_runner_test.py)来了解它的用法。\n", "\n", "特别是对于这两种策略的情况,建议您 1) 不仅要使用单元测试来覆盖它们的流,2) 还要尝试在单元测试中使用它来重现失败,以避免每次尝试修复时都启动真正的分布式作业。" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "migration_debugging.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 0 }