{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FEL3NlTTDlSX"
   },
   "source": [
    "##### Copyright 2021 The TensorFlow Authors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "cellView": "form",
    "execution": {
     "iopub.execute_input": "2023-11-07T19:58:53.321709Z",
     "iopub.status.busy": "2023-11-07T19:58:53.320985Z",
     "iopub.status.idle": "2023-11-07T19:58:53.325284Z",
     "shell.execute_reply": "2023-11-07T19:58:53.324628Z"
    },
    "id": "FlUw7tSKbtg4"
   },
   "outputs": [],
   "source": [
    "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "77z2OchJTk0l"
   },
   "source": [
    "# 调试 TensorFlow 2 迁移的训练流水线\n",
    "\n",
    "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
    "  <td>     <a target=\"_blank\" href=\"https://tensorflow.google.cn/guide/migrate/migration_debugging\"><img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\">在 TensorFlow.org 上查看</a>   </td>\n",
    "  <td>     <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/guide/migrate/migration_debugging.ipynb\"><img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\">在 Google Colab 中运行</a> </td>\n",
    "  <td>     <a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/guide/migrate/migration_debugging.ipynb\"><img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\">在 GitHub 上查看源代码</a>\n",
    "</td>\n",
    "  <td>     <a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/guide/migrate/migration_debugging.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\">下载笔记本</a>   </td>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zTwPu-w6M5sz"
   },
   "source": [
    "此笔记本演示了如何在迁移到 TensorFlow 2 (TF2) 时调试训练流水线。它由以下组件组成：\n",
    "\n",
    "1. 调试训练流水线的建议步骤和代码示例\n",
    "2. 用于调试的工具\n",
    "3. 其他相关资源\n",
    "\n",
    "一个假设是您有用于比较的 TensorFlow 1 (TF1.x) 代码和已训练模型，并且您希望构建一个 TF2 模型来实现类似的验证准确率。\n",
    "\n",
    "此笔记本**不**涵盖有关训练/推断速度或内存使用量的调试性能问题。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fKm9R4CtOAP3"
   },
   "source": [
    "## 调试工作流\n",
    "\n",
    "下面是调试 TF2 训练流水线的一般工作流。请注意，您不需要按顺序执行这些步骤。您也可以使用二分查找方法，在中间步骤中测试模型并缩小调试范围。\n",
    "\n",
    "1. 修复编译和运行时错误\n",
    "\n",
    "2. 单次前向传递验证（在单独的[指南](./validate_correctness.ipynb)中）\n",
    "\n",
    "    a. 在单个 CPU 设备上\n",
    "\n",
    "    - 验证变量是否只创建一次\n",
    "    - 检查变量计数、名称和形状是否匹配\n",
    "    - 重置所有变量，在停用所有随机性的情况下检查数值等价性\n",
    "    - 对齐随机数生成，检查推断中的数值等价性\n",
    "    - （可选）检查检查点已正确加载，TF1.x/TF2 模型生成相同的输出\n",
    "\n",
    "    b. 在单个 GPU/TPU 设备上\n",
    "\n",
    "    c. 采用多设备策略\n",
    "\n",
    "3. 几个步骤的模型训练数值等价性验证（下面提供代码示例）\n",
    "\n",
    "    a. 在单个 CPU 设备上使用少量固定数据进行单次训练步骤验证。具体来说，检查以下组件的数值等价性\n",
    "\n",
    "    - 损失计算\n",
    "    - 指标\n",
    "    - 学习率\n",
    "    - 梯度计算和更新\n",
    "\n",
    "    b. 在训练 3 个或更多步骤后检查统计数据，验证优化器的行为（如动量）在单个 CPU 设备上是否仍然使用固定数据\n",
    "\n",
    "    c. 在单个 GPU/TPU 设备上\n",
    "\n",
    "    d. 使用多设备策略（查看底部 [MultiProcessRunner](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/multi_process_runner.py#L108) 的介绍）\n",
    "\n",
    "4. 对真实数据集的端到端收敛测试\n",
    "\n",
    "    a. 使用 TensorBoard 检查训练行为\n",
    "\n",
    "    - 首先使用 SGD 等简单的优化器和 `tf.distribute.OneDeviceStrategy` 等简单的分布策略\n",
    "    - 训练指标\n",
    "    - 评估指标\n",
    "    - 找出对固有随机性的合理容忍度是多少\n",
    "\n",
    "    b. 使用高级优化器/学习率调度器/分布策略检查等价性\n",
    "\n",
    "    c. 使用混合精度时检查等价性\n",
    "\n",
    "5. 附加产品基准"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XKakQBI9-FLb"
   },
   "source": [
    "## 安装"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:58:53.329328Z",
     "iopub.status.busy": "2023-11-07T19:58:53.328931Z",
     "iopub.status.idle": "2023-11-07T19:59:21.924707Z",
     "shell.execute_reply": "2023-11-07T19:59:21.923786Z"
    },
    "id": "i1ghHyXl-Oqd"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\r\n",
      "tensorflow-datasets 4.9.3 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\r\n",
      "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\r\n",
      "\u001b[0m"
     ]
    }
   ],
   "source": [
    "# The `DeterministicRandomTestTool` is only available from Tensorflow 2.8:\n",
    "!pip install -q \"tensorflow==2.9.*\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "usyRSlIRl3r2"
   },
   "source": [
    "### 单前向传递验证\n",
    "\n",
    "单前向传递验证（包括检查点加载）将在不同的 [colab](./validate_correctness.ipynb) 中介绍。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:21.929277Z",
     "iopub.status.busy": "2023-11-07T19:59:21.928943Z",
     "iopub.status.idle": "2023-11-07T19:59:24.192347Z",
     "shell.execute_reply": "2023-11-07T19:59:24.191496Z"
    },
    "id": "HVBQbsZeVL_V"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-11-07 19:59:22.275296: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "import unittest\n",
    "import numpy as np\n",
    "\n",
    "import tensorflow as tf\n",
    "import tensorflow.compat.v1 as v1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4M104dt7m5cC"
   },
   "source": [
    "### 几个步骤的模型训练数值等价性验证"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "v2Nz2Ni1EkMz"
   },
   "source": [
    "设置模型配置并准备一个假数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:24.197049Z",
     "iopub.status.busy": "2023-11-07T19:59:24.196574Z",
     "iopub.status.idle": "2023-11-07T19:59:24.202189Z",
     "shell.execute_reply": "2023-11-07T19:59:24.201524Z"
    },
    "id": "hUxXadzKU9rT"
   },
   "outputs": [],
   "source": [
    "params = {\n",
    "    'input_size': 3,\n",
    "    'num_classes': 3,\n",
    "    'layer_1_size': 2,\n",
    "    'layer_2_size': 2,\n",
    "    'num_train_steps': 100,\n",
    "    'init_lr': 1e-3,\n",
    "    'end_lr': 0.0,\n",
    "    'decay_steps': 1000,\n",
    "    'lr_power': 1.0,\n",
    "}\n",
    "\n",
    "# make a small fixed dataset\n",
    "fake_x = np.ones((2, params['input_size']), dtype=np.float32)\n",
    "fake_y = np.zeros((2, params['num_classes']), dtype=np.int32)\n",
    "fake_y[0][0] = 1\n",
    "fake_y[1][1] = 1\n",
    "\n",
    "step_num = 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lV_n3Ukmz4Un"
   },
   "source": [
    "定义 TF1.x 模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:24.205710Z",
     "iopub.status.busy": "2023-11-07T19:59:24.205460Z",
     "iopub.status.idle": "2023-11-07T19:59:24.216387Z",
     "shell.execute_reply": "2023-11-07T19:59:24.215688Z"
    },
    "id": "ATa5fzL8mAwl"
   },
   "outputs": [],
   "source": [
    "# Assume there is an existing TF1.x model using estimator API\n",
    "# Wrap the model_fn to log necessary tensors for result comparison\n",
    "class SimpleModelWrapper():\n",
    "  def __init__(self):\n",
    "    self.logged_ops = {}\n",
    "    self.logs = {\n",
    "        'step': [],\n",
    "        'lr': [],\n",
    "        'loss': [],\n",
    "        'grads_and_vars': [],\n",
    "        'layer_out': []}\n",
    "     \n",
    "  def model_fn(self, features, labels, mode, params):\n",
    "      out_1 = tf.compat.v1.layers.dense(features, units=params['layer_1_size'])\n",
    "      out_2 = tf.compat.v1.layers.dense(out_1, units=params['layer_2_size'])\n",
    "      logits = tf.compat.v1.layers.dense(out_2, units=params['num_classes'])\n",
    "      loss = tf.compat.v1.losses.softmax_cross_entropy(labels, logits)\n",
    "\n",
    "      # skip EstimatorSpec details for prediction and evaluation \n",
    "      if mode == tf.estimator.ModeKeys.PREDICT:\n",
    "          pass\n",
    "      if mode == tf.estimator.ModeKeys.EVAL:\n",
    "          pass\n",
    "      assert mode == tf.estimator.ModeKeys.TRAIN\n",
    "\n",
    "      global_step = tf.compat.v1.train.get_or_create_global_step()\n",
    "      lr = tf.compat.v1.train.polynomial_decay(\n",
    "        learning_rate=params['init_lr'],\n",
    "        global_step=global_step,\n",
    "        decay_steps=params['decay_steps'],\n",
    "        end_learning_rate=params['end_lr'],\n",
    "        power=params['lr_power'])\n",
    "      \n",
    "      optmizer = tf.compat.v1.train.GradientDescentOptimizer(lr)\n",
    "      grads_and_vars = optmizer.compute_gradients(\n",
    "          loss=loss,\n",
    "          var_list=graph.get_collection(\n",
    "              tf.compat.v1.GraphKeys.TRAINABLE_VARIABLES))\n",
    "      train_op = optmizer.apply_gradients(\n",
    "          grads_and_vars,\n",
    "          global_step=global_step)\n",
    "      \n",
    "      # log tensors\n",
    "      self.logged_ops['step'] = global_step\n",
    "      self.logged_ops['lr'] = lr\n",
    "      self.logged_ops['loss'] = loss\n",
    "      self.logged_ops['grads_and_vars'] = grads_and_vars\n",
    "      self.logged_ops['layer_out'] = {\n",
    "          'layer_1': out_1,\n",
    "          'layer_2': out_2,\n",
    "          'logits': logits}\n",
    "\n",
    "      return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)\n",
    "\n",
    "  def update_logs(self, logs):\n",
    "    for key in logs.keys():\n",
    "      model_tf1.logs[key].append(logs[key])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kki9yILSKS7f"
   },
   "source": [
    "以下 [`v1.keras.utils.DeterministicRandomTestTool`](https://tensorflow.google.cn/api_docs/python/tf/compat/v1/keras/utils/DeterministicRandomTestTool) 类提供了一个上下文管理器 `scope()`，它可以使有状态的随机运算在 TF1 计算图/会话和 Eager Execution 中使用相同的种子。\n",
    "\n",
    "此工具提供两种测试模式：\n",
    "\n",
    "1. `constant`，无论被调用过多少次，都会为每个单一运算使用相同的种子，以及\n",
    "2. `num_random_ops`，使用先前观测到的有状态随机运算的数量作为运算种子。\n",
    "\n",
    "这既适用于用于创建和初始化变量的有状态随机运算，也适用于计算中使用的有状态随机运算（例如用于随机失活层）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:24.219711Z",
     "iopub.status.busy": "2023-11-07T19:59:24.219460Z",
     "iopub.status.idle": "2023-11-07T19:59:24.532035Z",
     "shell.execute_reply": "2023-11-07T19:59:24.531336Z"
    },
    "id": "X6Y3RWMoKOl8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /tmpfs/tmp/ipykernel_234052/2689227634.py:1: The name tf.keras.utils.DeterministicRandomTestTool is deprecated. Please use tf.compat.v1.keras.utils.DeterministicRandomTestTool instead.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mk5-ZzxcErX5"
   },
   "source": [
    "在计算图模式下运行 TF1.x 模型。收集前 3 个训练步骤的统计数据以进行数值等价性比较。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:24.536063Z",
     "iopub.status.busy": "2023-11-07T19:59:24.535761Z",
     "iopub.status.idle": "2023-11-07T19:59:25.438355Z",
     "shell.execute_reply": "2023-11-07T19:59:25.437228Z"
    },
    "id": "r5zhJHvsWA24"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-11-07 19:59:25.069458: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n",
      "2023-11-07 19:59:25.069585: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory\n",
      "2023-11-07 19:59:25.069663: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory\n",
      "2023-11-07 19:59:25.069737: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory\n",
      "2023-11-07 19:59:25.149993: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory\n",
      "2023-11-07 19:59:25.150247: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n",
      "Skipping registering GPU devices...\n",
      "/tmpfs/tmp/ipykernel_234052/1984550333.py:14: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n",
      "  out_1 = tf.compat.v1.layers.dense(features, units=params['layer_1_size'])\n",
      "/tmpfs/tmp/ipykernel_234052/1984550333.py:15: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n",
      "  out_2 = tf.compat.v1.layers.dense(out_1, units=params['layer_2_size'])\n",
      "/tmpfs/tmp/ipykernel_234052/1984550333.py:16: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n",
      "  logits = tf.compat.v1.layers.dense(out_2, units=params['num_classes'])\n"
     ]
    }
   ],
   "source": [
    "with random_tool.scope():\n",
    "  graph = tf.Graph()\n",
    "  with graph.as_default(), tf.compat.v1.Session(graph=graph) as sess:\n",
    "    model_tf1 = SimpleModelWrapper()\n",
    "    # build the model\n",
    "    inputs = tf.compat.v1.placeholder(tf.float32, shape=(None, params['input_size']))\n",
    "    labels = tf.compat.v1.placeholder(tf.float32, shape=(None, params['num_classes']))\n",
    "    spec = model_tf1.model_fn(inputs, labels, tf.estimator.ModeKeys.TRAIN, params)\n",
    "    train_op = spec.train_op\n",
    "\n",
    "    sess.run(tf.compat.v1.global_variables_initializer())\n",
    "    for step in range(step_num):\n",
    "      # log everything and update the model for one step\n",
    "      logs, _ = sess.run(\n",
    "          [model_tf1.logged_ops, train_op],\n",
    "          feed_dict={inputs: fake_x, labels: fake_y})\n",
    "      model_tf1.update_logs(logs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eZxjI8Nxz9Ea"
   },
   "source": [
    "定义 TF2 模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:25.442887Z",
     "iopub.status.busy": "2023-11-07T19:59:25.442589Z",
     "iopub.status.idle": "2023-11-07T19:59:25.453913Z",
     "shell.execute_reply": "2023-11-07T19:59:25.453186Z"
    },
    "id": "AA67rh2TkS1M"
   },
   "outputs": [],
   "source": [
    "class SimpleModel(tf.keras.Model):\n",
    "  def __init__(self, params, *args, **kwargs):\n",
    "    super(SimpleModel, self).__init__(*args, **kwargs)\n",
    "    # define the model\n",
    "    self.dense_1 = tf.keras.layers.Dense(params['layer_1_size'])\n",
    "    self.dense_2 = tf.keras.layers.Dense(params['layer_2_size'])\n",
    "    self.out = tf.keras.layers.Dense(params['num_classes'])\n",
    "    learning_rate_fn = tf.keras.optimizers.schedules.PolynomialDecay(\n",
    "      initial_learning_rate=params['init_lr'],\n",
    "      decay_steps=params['decay_steps'],\n",
    "      end_learning_rate=params['end_lr'],\n",
    "      power=params['lr_power'])  \n",
    "    self.optimizer = tf.keras.optimizers.legacy.SGD(learning_rate_fn)\n",
    "    self.compiled_loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)\n",
    "    self.logs = {\n",
    "        'lr': [],\n",
    "        'loss': [],\n",
    "        'grads': [],\n",
    "        'weights': [],\n",
    "        'layer_out': []}\n",
    "\n",
    "  def call(self, inputs):\n",
    "    out_1 = self.dense_1(inputs)\n",
    "    out_2 = self.dense_2(out_1)\n",
    "    logits = self.out(out_2)\n",
    "    # log output features for every layer for comparison\n",
    "    layer_wise_out = {\n",
    "        'layer_1': out_1,\n",
    "        'layer_2': out_2,\n",
    "        'logits': logits}\n",
    "    self.logs['layer_out'].append(layer_wise_out)\n",
    "    return logits\n",
    "\n",
    "  def train_step(self, data):\n",
    "    x, y = data\n",
    "    with tf.GradientTape() as tape:\n",
    "      logits = self(x)\n",
    "      loss = self.compiled_loss(y, logits)\n",
    "    grads = tape.gradient(loss, self.trainable_weights)\n",
    "    # log training statistics\n",
    "    step = self.optimizer.iterations.numpy()\n",
    "    self.logs['lr'].append(self.optimizer.learning_rate(step).numpy())\n",
    "    self.logs['loss'].append(loss.numpy())\n",
    "    self.logs['grads'].append(grads)\n",
    "    self.logs['weights'].append(self.trainable_weights)\n",
    "    # update model\n",
    "    self.optimizer.apply_gradients(zip(grads, self.trainable_weights))\n",
    "    return"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "I5smAcaEE8nX"
   },
   "source": [
    "在 Eager 模式下运行 TF2 模型。收集前 3 个训练步骤的统计数据以进行数值等价性比较。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:25.457910Z",
     "iopub.status.busy": "2023-11-07T19:59:25.457419Z",
     "iopub.status.idle": "2023-11-07T19:59:25.516581Z",
     "shell.execute_reply": "2023-11-07T19:59:25.515677Z"
    },
    "id": "Q0AbXF_eE8cS"
   },
   "outputs": [],
   "source": [
    "random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n",
    "with random_tool.scope():\n",
    "  model_tf2 = SimpleModel(params)\n",
    "  for step in range(step_num):\n",
    "    model_tf2.train_step([fake_x, fake_y])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cjJDjLcAz_gU"
   },
   "source": [
    "比较前几个训练步骤的数值等价性。\n",
    "\n",
    "您还可以查看[验证正确性和数值等价性笔记本](./validate_correctness.ipynb)以获得关于数值等价性的额外建议。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:25.521396Z",
     "iopub.status.busy": "2023-11-07T19:59:25.520967Z",
     "iopub.status.idle": "2023-11-07T19:59:25.561272Z",
     "shell.execute_reply": "2023-11-07T19:59:25.560102Z"
    },
    "id": "6CbCUbsCiabC"
   },
   "outputs": [],
   "source": [
    "np.testing.assert_allclose(model_tf1.logs['lr'], model_tf2.logs['lr'])\n",
    "np.testing.assert_allclose(model_tf1.logs['loss'], model_tf2.logs['loss'])\n",
    "for step in range(step_num):\n",
    "  for name in model_tf1.logs['layer_out'][step]:\n",
    "    np.testing.assert_allclose(\n",
    "        model_tf1.logs['layer_out'][step][name],\n",
    "        model_tf2.logs['layer_out'][step][name])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dhVuuciimLIY"
   },
   "source": [
    "#### 单元测试"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sXZYFC6Hhqeb"
   },
   "source": [
    "有几种类型的单元测试可以帮助调试迁移代码。\n",
    "\n",
    "1. 单前向传递验证\n",
    "2. 几个步骤的模型训练数值等价性验证\n",
    "3. 基准推断性能\n",
    "4. 已训练模型对固定和简单的数据点做出正确预测\n",
    "\n",
    "可以使用 `@parameterized.parameters` 来测试具有不同配置的模型。[包含代码示例的详细信息](https://github.com/abseil/abseil-py/blob/master/absl/testing/parameterized.py)。\n",
    "\n",
    "请注意，可以在同一个测试用例中运行会话 API 和 Eager Execution。下面的代码段显示了具体方式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:25.566564Z",
     "iopub.status.busy": "2023-11-07T19:59:25.565894Z",
     "iopub.status.idle": "2023-11-07T19:59:25.576572Z",
     "shell.execute_reply": "2023-11-07T19:59:25.575776Z"
    },
    "id": "CdHqkgPPM2Bj"
   },
   "outputs": [],
   "source": [
    "import unittest\n",
    "\n",
    "class TestNumericalEquivalence(unittest.TestCase):\n",
    "\n",
    "  # copied from code samples above\n",
    "  def setup(self):\n",
    "    # record statistics for 100 training steps\n",
    "    step_num = 100\n",
    "\n",
    "    # setup TF 1 model\n",
    "    random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n",
    "    with random_tool.scope():\n",
    "      # run TF1.x code in graph mode with context management\n",
    "      graph = tf.Graph()\n",
    "      with graph.as_default(), tf.compat.v1.Session(graph=graph) as sess:\n",
    "        self.model_tf1 = SimpleModelWrapper()\n",
    "        # build the model\n",
    "        inputs = tf.compat.v1.placeholder(tf.float32, shape=(None, params['input_size']))\n",
    "        labels = tf.compat.v1.placeholder(tf.float32, shape=(None, params['num_classes']))\n",
    "        spec = self.model_tf1.model_fn(inputs, labels, tf.estimator.ModeKeys.TRAIN, params)\n",
    "        train_op = spec.train_op\n",
    "\n",
    "        sess.run(tf.compat.v1.global_variables_initializer())\n",
    "        for step in range(step_num):\n",
    "          # log everything and update the model for one step\n",
    "          logs, _ = sess.run(\n",
    "              [self.model_tf1.logged_ops, train_op],\n",
    "              feed_dict={inputs: fake_x, labels: fake_y})\n",
    "          self.model_tf1.update_logs(logs)\n",
    "\n",
    "    # setup TF2 model\n",
    "    random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n",
    "    with random_tool.scope():\n",
    "      self.model_tf2 = SimpleModel(params)\n",
    "      for step in range(step_num):\n",
    "        self.model_tf2.train_step([fake_x, fake_y])\n",
    "  \n",
    "  def test_learning_rate(self):\n",
    "    np.testing.assert_allclose(\n",
    "        self.model_tf1.logs['lr'],\n",
    "        self.model_tf2.logs['lr'])\n",
    "\n",
    "  def test_training_loss(self):\n",
    "    # adopt different tolerance strategies before and after 10 steps\n",
    "    first_n_step = 10\n",
    "\n",
    "    # absolute difference is limited below 1e-5\n",
    "    # set `equal_nan` to be False to detect potential NaN loss issues\n",
    "    abosolute_tolerance = 1e-5\n",
    "    np.testing.assert_allclose(\n",
    "        actual=self.model_tf1.logs['loss'][:first_n_step],\n",
    "        desired=self.model_tf2.logs['loss'][:first_n_step],\n",
    "        atol=abosolute_tolerance,\n",
    "        equal_nan=False)\n",
    "    \n",
    "    # relative difference is limited below 5%\n",
    "    relative_tolerance = 0.05\n",
    "    np.testing.assert_allclose(self.model_tf1.logs['loss'][first_n_step:],\n",
    "                               self.model_tf2.logs['loss'][first_n_step:],\n",
    "                               rtol=relative_tolerance,\n",
    "                               equal_nan=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gshSQdKIddpZ"
   },
   "source": [
    "## 调试工具"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CkMfCaJRclKv"
   },
   "source": [
    "### tf.print\n",
    "\n",
    "tf.print 与 print/logging.info\n",
    "\n",
    "- 利用可配置参数，`tf.print` 能够以递归方式显示打印张量的每个维度的前几个和最后几个元素。请查看 [API 文档](https://tensorflow.google.cn/api_docs/python/tf/print)以了解详情。\n",
    "- 对于 Eager Execution，`print` 和 `tf.print` 都会打印张量的值。但 `print` 可能涉及设备到主机的复制，这可能会减慢代码速度。\n",
    "- 对于包含 `tf.function` 内用法的计算图模式，您需要使用 `tf.print` 打印实际张量值。`tf.print` 被编译成计算图中的一个运算，而 `print` 和 `logging.info` 只在跟踪时记录，这通常不是您希望的。\n",
    "- `tf.print` 还支持打印复合张量，如 `tf.RaggedTensor` 和 `tf.sparse.SparseTensor`。\n",
    "- 您还可以使用回调来监视指标和变量。请检查如何使用带有[日志字典](https://tensorflow.google.cn/guide/keras/custom_callback#usage_of_logs_dict)和 [self.model 特性](https://tensorflow.google.cn/guide/keras/custom_callback#usage_of_selfmodel_attribute)的自定义回调。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "S-5h3cX8Dc50"
   },
   "source": [
    "tf.print 与 tf.function 内的 print"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:25.580441Z",
     "iopub.status.busy": "2023-11-07T19:59:25.579896Z",
     "iopub.status.idle": "2023-11-07T19:59:25.650048Z",
     "shell.execute_reply": "2023-11-07T19:59:25.649208Z"
    },
    "id": "gRED9FMyDKih"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tensor(\"add:0\", shape=(1,), dtype=float32)\n",
      "[2]\n"
     ]
    }
   ],
   "source": [
    "# `print` prints info of tensor object\n",
    "# `tf.print` prints the tensor value\n",
    "@tf.function\n",
    "def dummy_func(num):\n",
    "  num += 1\n",
    "  print(num)\n",
    "  tf.print(num)\n",
    "  return num\n",
    "\n",
    "_ = dummy_func(tf.constant([1.0]))\n",
    "\n",
    "# Output:\n",
    "# Tensor(\"add:0\", shape=(1,), dtype=float32)\n",
    "# [2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3QroLA_zDK2w"
   },
   "source": [
    "tf.distribute.Strategy\n",
    "\n",
    "- 如果包含 `tf.print` 的 `tf.function` 在工作进程上执行，例如使用 `TPUStrategy` 或 `ParameterServerStrategy` 时，您需要检查工作进程/参数服务器日志以查找打印的值。\n",
    "- 对于 `print` 或 `logging.info`，使用 `ParameterServerStrategy` 时将在协调器上打印日志，使用 TPU 时将在 worker0 的 STDOUT 上打印日志。\n",
    "\n",
    "tf.keras.Model\n",
    "\n",
    "- 使用序列式和函数式 API 模型时，如果您想在某些层之后打印值，例如模型输入或中间特征，您可以选择以下选项。\n",
    "    1. [编写一个自定义层](https://tensorflow.google.cn/guide/keras/custom_layers_and_models)，该层使用 `tf.print` 打印输入。\n",
    "    2. 在模型输出中包含要检查的中间输出。\n",
    "- `tf.keras.layers.Lambda` 层具有（反）序列化限制。为避免检查点加载问题，请改为编写自定义子类化层。请参阅 [API 文档](https://tensorflow.google.cn/api_docs/python/tf/keras/layers/Lambda)，了解更多详细信息。\n",
    "- 如果您无权访问实际值，则无法在 `tf.keras.callbacks.LambdaCallback` 中使用 `tf.print` 打印中间输出，而只能访问符号 Keras 张量对象。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aKazGTr1ZUMG"
   },
   "source": [
    "选项 1：编写一个自定义层"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:25.654207Z",
     "iopub.status.busy": "2023-11-07T19:59:25.653928Z",
     "iopub.status.idle": "2023-11-07T19:59:26.102988Z",
     "shell.execute_reply": "2023-11-07T19:59:26.102296Z"
    },
    "id": "8w4aY7wO0B4W"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[-0.327884018]\n",
      " [-0.109294683]\n",
      " [-0.218589365]]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "1/1 [==============================] - ETA: 0s - loss: 0.6077"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "1/1 [==============================] - 0s 319ms/step - loss: 0.6077\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7f00cfd4c460>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "class PrintLayer(tf.keras.layers.Layer):\n",
    "  def call(self, inputs):\n",
    "    tf.print(inputs)\n",
    "    return inputs\n",
    "\n",
    "def get_model():\n",
    "  inputs = tf.keras.layers.Input(shape=(1,))\n",
    "  out_1 = tf.keras.layers.Dense(4)(inputs)\n",
    "  out_2 = tf.keras.layers.Dense(1)(out_1)\n",
    "  # use custom layer to tf.print intermediate features\n",
    "  out_3 = PrintLayer()(out_2)\n",
    "  model = tf.keras.Model(inputs=inputs, outputs=out_3)\n",
    "  return model\n",
    "\n",
    "model = get_model()\n",
    "model.compile(optimizer=\"adam\", loss=\"mse\")\n",
    "model.fit([1, 2, 3], [0.0, 0.0, 1.0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KNESOatq7iM9"
   },
   "source": [
    "选项 2：在模型输出中包含要检查的中间输出。\n",
    "\n",
    "请注意，在这种情况下，您可能需要进行一些[自定义](https://tensorflow.google.cn/guide/keras/customizing_what_happens_in_fit)才能使用 `Model.fit`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T19:59:26.108865Z",
     "iopub.status.busy": "2023-11-07T19:59:26.108569Z",
     "iopub.status.idle": "2023-11-07T19:59:26.113371Z",
     "shell.execute_reply": "2023-11-07T19:59:26.112678Z"
    },
    "id": "MiifvdLk7g9J"
   },
   "outputs": [],
   "source": [
    "def get_model():\n",
    "  inputs = tf.keras.layers.Input(shape=(1,))\n",
    "  out_1 = tf.keras.layers.Dense(4)(inputs)\n",
    "  out_2 = tf.keras.layers.Dense(1)(out_1)\n",
    "  # include intermediate values in model outputs\n",
    "  model = tf.keras.Model(\n",
    "      inputs=inputs,\n",
    "      outputs={\n",
    "          'inputs': inputs,\n",
    "          'out_1': out_1,\n",
    "          'out_2': out_2})\n",
    "  return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MvIKDZpHSLmQ"
   },
   "source": [
    "### pdb\n",
    "\n",
    "可以在终端和 Colab 中使用 [pdb](https://docs.python.org/3/library/pdb.html) 来检查中间值以进行调试。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Qu0n4O2umyT7"
   },
   "source": [
    "### 使用 TensorBoard 呈现计算图\n",
    "\n",
    "可以[使用 TensorBoard 检查 TensorFlow 计算图](https://tensorflow.google.cn/tensorboard/graphs)。[Colab 上也支持](https://tensorflow.google.cn/tensorboard/tensorboard_in_notebooks) TensorBoard。TensorBoard 是呈现摘要的绝佳工具。您可以利用它来比较训练过程中的学习率、模型权重、梯度尺度、训练/验证指标，甚至是 TF1.x 模型和迁移的 TF2 模型之间的模型中间输出，并查看值是否符合预期。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vBnxB6_xzlnT"
   },
   "source": [
    "### TensorFlow Profiler\n",
    "\n",
    "[TensorFlow Profiler](https://tensorflow.google.cn/guide/profiler) 可以帮助您呈现 GPU/TPU 上的执行时间线。可以查看此 [Colab 演示](https://tensorflow.google.cn/tensorboard/tensorboard_profiling_keras)来了解其基本用法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9wNmCSHBpiGM"
   },
   "source": [
    "### MultiProcessRunner\n",
    "\n",
    "在使用 MultiWorkerMirroredStrategy 和 ParameterServerStrategy 进行调试时，[MultiProcessRunner](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/multi_process_runner.py#L108) 是一个实用的工具。可以查看[此具体示例](https://github.com/keras-team/keras/blob/master/keras/integration_test/mwms_multi_process_runner_test.py)来了解它的用法。\n",
    "\n",
    "特别是对于这两种策略的情况，建议您 1) 不仅要使用单元测试来覆盖它们的流，2) 还要尝试在单元测试中使用它来重现失败，以避免每次尝试修复时都启动真正的分布式作业。"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "migration_debugging.ipynb",
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}