{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 2D 卷积优化\n",
        "\n",
        "**原作者**: [Thierry Moreau](https://homes.cs.washington.edu/~moreau/)\n",
        "\n",
        "本教程提供了关于如何使用 TVM 映射二维卷积工作负载有效的 VTA 设计的概述。建议先学习 {ref}`vta-mat-mult-opt` 教程。\n",
        "\n",
        "二维卷积在大多数计算机视觉深度神经网络中占主导地位。在本教程中，将演示 TVM 调度优化，将 NCHW 布局中的 2D 卷积算子映射到 VTA。还引入了延迟隐藏（latency hiding）的概念，它允许最大化 VTA 的计算和内存资源利用。\n",
        "\n",
        "## RPC 设置\n",
        "\n",
        "首先编程 Pynq 的 FPGA 并构建它的 RPC 运行时。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import os\n",
        "import tvm\n",
        "import tvm.testing\n",
        "from tvm import te\n",
        "import vta\n",
        "import numpy as np\n",
        "\n",
        "from tvm import rpc\n",
        "from tvm.contrib import utils\n",
        "from vta.testing import simulator\n",
        "\n",
        "# Load VTA parameters from the 3rdparty/vta-hw/config/vta_config.json file\n",
        "env = vta.get_env()\n",
        "\n",
        "# We read the Pynq RPC host IP address and port number from the OS environment\n",
        "host = os.environ.get(\"VTA_RPC_HOST\", \"192.168.2.99\")\n",
        "port = int(os.environ.get(\"VTA_RPC_PORT\", \"9091\"))\n",
        "\n",
        "# We configure both the bitstream and the runtime system on the Pynq\n",
        "# to match the VTA configuration specified by the vta_config.json file.\n",
        "if env.TARGET == \"pynq\":\n",
        "    # Make sure that TVM was compiled with RPC=1\n",
        "    assert tvm.runtime.enabled(\"rpc\")\n",
        "    remote = rpc.connect(host, port)\n",
        "\n",
        "    # Reconfigure the JIT runtime\n",
        "    vta.reconfig_runtime(remote)\n",
        "\n",
        "    # Program the FPGA with a pre-compiled VTA bitstream.\n",
        "    # You can program the FPGA with your own custom bitstream\n",
        "    # by passing the path to the bitstream file instead of None.\n",
        "    vta.program_fpga(remote, bitstream=None)\n",
        "\n",
        "# In simulation mode, host the RPC server locally.\n",
        "elif env.TARGET in [\"sim\", \"tsim\"]:\n",
        "    remote = rpc.LocalSession()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 声明计算\n",
        "\n",
        "作为第一步，需要用 NCHW 格式描述 2D 卷积计算。\n",
        "\n",
        "通过 batch size、空间维度、输入通道、输出通道、核维度、核维度、填充维度和步长维度来定义二维卷积形状。\n",
        "\n",
        "选择 ResNet-18 架构的第 9 个卷积层的形状作为卷积 workload 参数。\n",
        "\n",
        "在 2D 卷积中添加了额外的算子，用于对输出进行移位和剪切，以模拟定点卷积之后的修正线性激活。将二维卷积层的 TVM 数据流图描述如下：\n",
        "\n",
        "```{image} images/conv2d_dataflow.png\n",
        ":align: center\n",
        "```\n",
        "\n",
        "这个计算被故意设置得太大，以至于不能一次全部放入 VTA 的 on-chip buffers。因此，在调度阶段，将依靠计算分块策略将计算分解为可管理的块。\n",
        "\n",
        "````{admonition} 空间填充\n",
        ":class: alert alert-info\n",
        "\n",
        "注意，需要导入 TOPI 库来对输入特征映射张量应用空间填充（Spatial padding）。空间填充有助于在 2D 卷积环境中分块，因为如果卷积核窗口大小大于 1，那么任何给定层的输入特征映射的相同 `(x, y)` 空间位置将被读取多次。在 CPU 和 GPU 上，当并行工作时，提高内存访问效率的一种方法是空间打包（spatial packing），这需要重新布局数据。VTA load DMA 引擎可以自动插入填充，这样原始的输入特征映射就不必在内存中重新打包。\n",
        "\n",
        "当数据从 DRAM load 到 VTA 的 SRAM 时，下面展示了 VTA 对动态空间填充的影响，随后是 2D 跨步和填充内存读取。\n",
        "\n",
        "```{image} images/padding.png\n",
        ":align: center\n",
        ":width: 480px\n",
        "```\n",
        "````"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from tvm import topi\n",
        "\n",
        "# 2D convolution layer dimensions taken from ResNet-18 architecture\n",
        "# (9th convolutional layer)\n",
        "batch_size = 1\n",
        "height = 14\n",
        "width = 14\n",
        "in_channels = 256\n",
        "out_channels = 256\n",
        "kernel_h = 3\n",
        "kernel_w = 3\n",
        "pad_h = 1\n",
        "pad_w = 1\n",
        "stride_h = 1\n",
        "stride_w = 1\n",
        "assert batch_size % env.BATCH == 0\n",
        "assert in_channels % env.BLOCK_IN == 0\n",
        "assert out_channels % env.BLOCK_OUT == 0\n",
        "\n",
        "# Input feature map: (N, IC, H, W, n, ic)\n",
        "data_shape = (\n",
        "    batch_size // env.BATCH,\n",
        "    in_channels // env.BLOCK_IN,\n",
        "    height,\n",
        "    width,\n",
        "    env.BATCH,\n",
        "    env.BLOCK_IN,\n",
        ")\n",
        "# Kernel: (OC, IC, H, W, oc, ic)\n",
        "kernel_shape = (\n",
        "    out_channels // env.BLOCK_OUT,\n",
        "    in_channels // env.BLOCK_IN,\n",
        "    kernel_h,\n",
        "    kernel_w,\n",
        "    env.BLOCK_OUT,\n",
        "    env.BLOCK_IN,\n",
        ")\n",
        "# Derive output feature map dimensions\n",
        "fout_height = (height + 2 * pad_h - kernel_h) // stride_h + 1\n",
        "fout_width = (width + 2 * pad_w - kernel_w) // stride_w + 1\n",
        "# Output feature map: (N, OC, H, W, n, oc)\n",
        "output_shape = (\n",
        "    batch_size // env.BATCH,\n",
        "    out_channels // env.BLOCK_OUT,\n",
        "    fout_height,\n",
        "    fout_width,\n",
        "    env.BATCH,\n",
        "    env.BLOCK_OUT,\n",
        ")\n",
        "\n",
        "# Convolution reduction axes\n",
        "dy = te.reduce_axis((0, kernel_h), name=\"dy\")\n",
        "dx = te.reduce_axis((0, kernel_w), name=\"dx\")\n",
        "ic = te.reduce_axis((0, in_channels // env.BLOCK_IN), name=\"ic\")\n",
        "ic_tns = te.reduce_axis((0, env.BLOCK_IN), name=\"ic_tns\")\n",
        "\n",
        "# Input placeholder tensors\n",
        "data = te.placeholder(data_shape, name=\"data\", dtype=env.inp_dtype)\n",
        "kernel = te.placeholder(kernel_shape, name=\"kernel\", dtype=env.wgt_dtype)\n",
        "\n",
        "# Copy buffers:\n",
        "#   Apply spatial padding to input feature map\n",
        "data_buf = topi.nn.pad(data, [0, 0, pad_h, pad_w, 0, 0], name=\"data_buf\")\n",
        "kernel_buf = te.compute(kernel_shape, lambda *i: kernel(*i), \"kernel_buf\")\n",
        "\n",
        "# Declare 2D convolution\n",
        "res_conv = te.compute(\n",
        "    output_shape,\n",
        "    lambda bo, co, i, j, bi, ci: te.sum(\n",
        "        data_buf[bo, ic, i * stride_h + dy, j * stride_w + dx, bi, ic_tns].astype(env.acc_dtype)\n",
        "        * kernel_buf[co, ic, dy, dx, ci, ic_tns].astype(env.acc_dtype),\n",
        "        axis=[ic, dy, dx, ic_tns],\n",
        "    ),\n",
        "    name=\"res_conv\",\n",
        ")\n",
        "\n",
        "# Add shift stage for fix-point normalization\n",
        "res_shr = te.compute(output_shape, lambda *i: res_conv(*i) >> 8, name=\"res_shr\")\n",
        "\n",
        "# Apply clipping between (0, input max value)\n",
        "inp_max = (1 << (env.INP_WIDTH - 1)) - 1\n",
        "res_max = te.compute(output_shape, lambda *i: tvm.te.max(res_shr(*i), 0), \"res_max\")\n",
        "res_min = te.compute(output_shape, lambda *i: tvm.te.min(res_max(*i), inp_max), \"res_min\")\n",
        "\n",
        "# Result Tensor\n",
        "res = te.compute(output_shape, lambda *i: res_min(*i).astype(env.inp_dtype), name=\"res\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 调度计算\n",
        "\n",
        "将看到一组必要的调度变换，以有效的方式将 2D 卷积映射到 VTA。这些包括：\n",
        "\n",
        "- 分块计算\n",
        "- 增加计算利用率（compute utilization）的虚拟线程（Virtual threading）\n",
        "- Lowering 到 VTA 硬件 intrinsics"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "collapsed": false
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "@main = primfn(data_1: handle, kernel_1: handle, res_1: handle) -> ()\n",
            "  attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n",
            "  buffers = {data: Buffer(data_2: Pointer(int8), int8, [50176], []),\n",
            "             kernel: Buffer(kernel_2: Pointer(int8), int8, [589824], []),\n",
            "             res: Buffer(res_2: Pointer(int8), int8, [50176], [])}\n",
            "  buffer_map = {data_1: data, kernel_1: kernel, res_1: res}\n",
            "  preflattened_buffer_map = {data_1: data_3: Buffer(data_2, int8, [1, 16, 14, 14, 1, 16], []), kernel_1: kernel_3: Buffer(kernel_2, int8, [16, 16, 3, 3, 16, 16], []), res_1: res_3: Buffer(res_2, int8, [1, 16, 14, 14, 1, 16], [])} {\n",
            "  allocate(data_buf: Pointer(global int8), int8, [65536]), storage_scope = global;\n",
            "  allocate(kernel_buf: Pointer(global int8), int8, [589824]), storage_scope = global;\n",
            "  allocate(res_conv: Pointer(global int32), int32, [50176]), storage_scope = global {\n",
            "    for (i1: int32, 0, 16) {\n",
            "      for (i2: int32, 0, 16) {\n",
            "        for (i3: int32, 0, 16) {\n",
            "          for (i5: int32, 0, 16) {\n",
            "            let cse_var_1: int32 = (i3*16)\n",
            "            data_buf_1: Buffer(data_buf, int8, [65536], [])[((((i1*4096) + (i2*256)) + cse_var_1) + i5)] = @tir.if_then_else(((((1 <= i2) && (i2 < 15)) && (1 <= i3)) && (i3 < 15)), data[(((((i1*3136) + (i2*224)) + cse_var_1) + i5) - 240)], 0i8, dtype=int8)\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "    }\n",
            "    for (i0: int32, 0, 16) {\n",
            "      for (i1_1: int32, 0, 16) {\n",
            "        for (i2_1: int32, 0, 3) {\n",
            "          for (i3_1: int32, 0, 3) {\n",
            "            for (i4: int32, 0, 16) {\n",
            "              for (i5_1: int32, 0, 16) {\n",
            "                let cse_var_2: int32 = ((((((i0*36864) + (i1_1*2304)) + (i2_1*768)) + (i3_1*256)) + (i4*16)) + i5_1)\n",
            "                kernel_buf_1: Buffer(kernel_buf, int8, [589824], [])[cse_var_2] = kernel[cse_var_2]\n",
            "              }\n",
            "            }\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "    }\n",
            "    for (co: int32, 0, 16) {\n",
            "      for (i: int32, 0, 14) {\n",
            "        for (j: int32, 0, 14) {\n",
            "          for (ci: int32, 0, 16) {\n",
            "            res_conv_1: Buffer(res_conv, int32, [50176], [])[((((co*3136) + (i*224)) + (j*16)) + ci)] = 0\n",
            "            for (ic: int32, 0, 16) {\n",
            "              for (dy: int32, 0, 3) {\n",
            "                for (dx: int32, 0, 3) {\n",
            "                  for (ic_tns: int32, 0, 16) {\n",
            "                    let cse_var_4: int32 = (j*16)\n",
            "                    let cse_var_3: int32 = ((((co*3136) + (i*224)) + cse_var_4) + ci)\n",
            "                    res_conv_1[cse_var_3] = (res_conv_1[cse_var_3] + (cast(int32, data_buf_1[((((((ic*4096) + (i*256)) + (dy*256)) + cse_var_4) + (dx*16)) + ic_tns)])*cast(int32, kernel_buf_1[((((((co*36864) + (ic*2304)) + (dy*768)) + (dx*256)) + (ci*16)) + ic_tns)])))\n",
            "                  }\n",
            "                }\n",
            "              }\n",
            "            }\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "    }\n",
            "    for (i1_2: int32, 0, 16) {\n",
            "      for (i2_2: int32, 0, 14) {\n",
            "        for (i3_2: int32, 0, 14) {\n",
            "          for (i5_2: int32, 0, 16) {\n",
            "            let cse_var_5: int32 = ((((i1_2*3136) + (i2_2*224)) + (i3_2*16)) + i5_2)\n",
            "            res_conv_2: Buffer(res_conv, int32, [50176], [])[cse_var_5] = @tir.shift_right(res_conv_1[cse_var_5], 8, dtype=int32)\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "    }\n",
            "    for (i1_3: int32, 0, 16) {\n",
            "      for (i2_3: int32, 0, 14) {\n",
            "        for (i3_3: int32, 0, 14) {\n",
            "          for (i5_3: int32, 0, 16) {\n",
            "            let cse_var_6: int32 = ((((i1_3*3136) + (i2_3*224)) + (i3_3*16)) + i5_3)\n",
            "            res_conv_3: Buffer(res_conv, int32, [50176], [])[cse_var_6] = max(res_conv_2[cse_var_6], 0)\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "    }\n",
            "    for (i1_4: int32, 0, 16) {\n",
            "      for (i2_4: int32, 0, 14) {\n",
            "        for (i3_4: int32, 0, 14) {\n",
            "          for (i5_4: int32, 0, 16) {\n",
            "            let cse_var_7: int32 = ((((i1_4*3136) + (i2_4*224)) + (i3_4*16)) + i5_4)\n",
            "            res_conv_4: Buffer(res_conv, int32, [50176], [])[cse_var_7] = min(res_conv_3[cse_var_7], 127)\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "    }\n",
            "    for (i1_5: int32, 0, 16) {\n",
            "      for (i2_5: int32, 0, 14) {\n",
            "        for (i3_5: int32, 0, 14) {\n",
            "          for (i5_5: int32, 0, 16) {\n",
            "            let cse_var_8: int32 = ((((i1_5*3136) + (i2_5*224)) + (i3_5*16)) + i5_5)\n",
            "            res[cse_var_8] = cast(int8, res_conv_4[cse_var_8])\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "    }\n",
            "  }\n",
            "}\n",
            "\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# Create TVM schedule\n",
        "s = te.create_schedule(res.op)\n",
        "# Let's look at the default TVM schedule\n",
        "print(tvm.lower(s, [data, kernel, res], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 分块计算\n",
        "\n",
        "默认情况下，2D 卷积太大，激活或卷积核权重无法同时适应 VTA 的 on-chip buffer。沿着输入通道、输出通道和高度空间维度应用分块。不沿宽度空间维度进行分块，因为它是 NCHW 布局中的最内层维度（因此，为了增加局部性，最好不要沿最内层维度进行分块）。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Let's define tiling sizes\n",
        "b_block = 1 // env.BATCH\n",
        "oc_block = 128 // env.BLOCK_OUT\n",
        "ic_block = 16 // env.BLOCK_IN\n",
        "h_block = 7\n",
        "w_block = 14\n",
        "\n",
        "# Tile the output tensor along the spatial and output channel dimensions\n",
        "# (since by default we are doing single batch inference, the split along\n",
        "#  the batch dimension has no effect)\n",
        "b, oc, y, x, b_tns, oc_tns = s[res].op.axis\n",
        "b_out, b_inn = s[res].split(b, factor=b_block)\n",
        "oc_out, oc_inn = s[res].split(oc, factor=oc_block)\n",
        "y_out, y_inn = s[res].split(y, factor=h_block)\n",
        "x_out, x_inn = s[res].split(x, factor=w_block)\n",
        "s[res].reorder(b_out, oc_out, y_out, x_out, b_inn, oc_inn, y_inn, x_inn, b_tns, oc_tns)\n",
        "\n",
        "# Move intermediate computation into each output compute tile\n",
        "s[res_conv].compute_at(s[res], x_out)\n",
        "s[res_shr].compute_at(s[res], x_out)\n",
        "s[res_max].compute_at(s[res], x_out)\n",
        "s[res_min].compute_at(s[res], x_out)\n",
        "\n",
        "# Apply additional loop split along reduction axis (input channel)\n",
        "b_inn, oc_inn, y_inn, x_inn, b_tns, oc_tns = s[res_conv].op.axis\n",
        "ic_out, ic_inn = s[res_conv].split(ic, factor=ic_block)\n",
        "\n",
        "# Reorder axes.\n",
        "# 1) Group the VTA tensor axes in the inner most position: b_tns, oc_tns, ic_tns\n",
        "#    to allow TVM to tensorize.\n",
        "# 2) We move the ic_out axis all the way out of the convolution loop to block\n",
        "#    along the reduction axis.\n",
        "# 3) Now we re-order the block axes: b_inn, oc_inn, y_inn, x_inn, ic_inn, dy, dx.\n",
        "#    VTA runtime/hardware requires us to write to a different output feature map\n",
        "#    location for every VTA tensor operation.\n",
        "#    This restriction requires us to order one of oc_inn, y_inn or x_inn right\n",
        "#    before b_tns, since they all affect output feature map indexing.\n",
        "#    Therefore, we choose to bring x_inn inside as shown below.\n",
        "s[res_conv].reorder(ic_out, b_inn, oc_inn, y_inn, ic_inn, dy, dx, x_inn, b_tns, oc_tns, ic_tns)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 虚拟线程\n",
        "\n",
        "虚拟线程是一种在 VTA 硬件设计中增加任务级管道并行性的机制。换句话说，它通过隐藏内存访问延迟（hiding memory access latency）提高了计算资源的利用率。\n",
        "\n",
        "在下面的实现中，虚拟线程将工作分配给沿输出通道轴划分的两个线程。在下面的图中，展示了计算 2D 卷积时工作是如何分割的。\n",
        "\n",
        "```{image} images/virtual_threading.png\n",
        ":align: center\n",
        ":width: 480px\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "collapsed": false
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "@main = primfn(data_1: handle, kernel_1: handle, res_1: handle) -> ()\n",
            "  attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n",
            "  buffers = {data: Buffer(data_2: Pointer(int8), int8, [50176], []),\n",
            "             kernel: Buffer(kernel_2: Pointer(int8), int8, [589824], []),\n",
            "             res: Buffer(res_2: Pointer(int8), int8, [50176], [])}\n",
            "  buffer_map = {data_1: data, kernel_1: kernel, res_1: res}\n",
            "  preflattened_buffer_map = {data_1: data_3: Buffer(data_2, int8, [1, 16, 14, 14, 1, 16], []), kernel_1: kernel_3: Buffer(kernel_2, int8, [16, 16, 3, 3, 16, 16], []), res_1: res_3: Buffer(res_2, int8, [1, 16, 14, 14, 1, 16], [])} {\n",
            "  allocate(data_buf: Pointer(global int8), int8, [65536]), storage_scope = global;\n",
            "  allocate(kernel_buf: Pointer(global int8), int8, [589824]), storage_scope = global;\n",
            "  allocate(res_conv: Pointer(global int32), int32, [25088]), storage_scope = global {\n",
            "    for (i1: int32, 0, 16) {\n",
            "      for (i2: int32, 0, 16) {\n",
            "        for (i3: int32, 0, 16) {\n",
            "          for (i5: int32, 0, 16) {\n",
            "            let cse_var_1: int32 = (i3*16)\n",
            "            data_buf_1: Buffer(data_buf, int8, [65536], [])[((((i1*4096) + (i2*256)) + cse_var_1) + i5)] = @tir.if_then_else(((((1 <= i2) && (i2 < 15)) && (1 <= i3)) && (i3 < 15)), data[(((((i1*3136) + (i2*224)) + cse_var_1) + i5) - 240)], 0i8, dtype=int8)\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "    }\n",
            "    for (i0: int32, 0, 16) {\n",
            "      for (i1_1: int32, 0, 16) {\n",
            "        for (i2_1: int32, 0, 3) {\n",
            "          for (i3_1: int32, 0, 3) {\n",
            "            for (i4: int32, 0, 16) {\n",
            "              for (i5_1: int32, 0, 16) {\n",
            "                let cse_var_2: int32 = ((((((i0*36864) + (i1_1*2304)) + (i2_1*768)) + (i3_1*256)) + (i4*16)) + i5_1)\n",
            "                kernel_buf_1: Buffer(kernel_buf, int8, [589824], [])[cse_var_2] = kernel[cse_var_2]\n",
            "              }\n",
            "            }\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "    }\n",
            "    for (i2.outer: int32, 0, 2) {\n",
            "      for (co.init: int32, 0, 8) {\n",
            "        for (i.init: int32, 0, 7) {\n",
            "          for (j.init: int32, 0, 14) {\n",
            "            for (ci.init: int32, 0, 16) {\n",
            "              let cse_var_3: int32 = ((((co.init*1568) + (i.init*224)) + (j.init*16)) + ci.init)\n",
            "               {\n",
            "                res_conv_1: Buffer(res_conv, int32, [157351936], [])[cse_var_3] = 0\n",
            "                res_conv_1[(cse_var_3 + 12544)] = 0\n",
            "              }\n",
            "            }\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "      for (ic.outer: int32, 0, 16) {\n",
            "        for (co: int32, 0, 8) {\n",
            "          for (i: int32, 0, 7) {\n",
            "            for (dy: int32, 0, 3) {\n",
            "              for (dx: int32, 0, 3) {\n",
            "                for (j: int32, 0, 14) {\n",
            "                  for (ci: int32, 0, 16) {\n",
            "                    for (ic_tns: int32, 0, 16) {\n",
            "                      let cse_var_8: int32 = (j*16)\n",
            "                      let cse_var_7: int32 = ((((co*1568) + (i*224)) + cse_var_8) + ci)\n",
            "                      let cse_var_6: int32 = (cse_var_7 + 12544)\n",
            "                      let cse_var_5: int32 = ((((((co*36864) + (ic.outer*2304)) + (dy*768)) + (dx*256)) + (ci*16)) + ic_tns)\n",
            "                      let cse_var_4: int32 = (((((((ic.outer*4096) + (i2.outer*1792)) + (i*256)) + (dy*256)) + cse_var_8) + (dx*16)) + ic_tns)\n",
            "                       {\n",
            "                        res_conv_1[cse_var_7] = (res_conv_1[cse_var_7] + (cast(int32, data_buf_1[cse_var_4])*cast(int32, kernel_buf_1[cse_var_5])))\n",
            "                        res_conv_1[cse_var_6] = (res_conv_1[cse_var_6] + (cast(int32, data_buf_1[cse_var_4])*cast(int32, kernel_buf_1[(cse_var_5 + 294912)])))\n",
            "                      }\n",
            "                    }\n",
            "                  }\n",
            "                }\n",
            "              }\n",
            "            }\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "      for (i1_2: int32, 0, 8) {\n",
            "        for (i2_2: int32, 0, 7) {\n",
            "          for (i3_2: int32, 0, 14) {\n",
            "            for (i5_2: int32, 0, 16) {\n",
            "              let cse_var_10: int32 = ((((i1_2*1568) + (i2_2*224)) + (i3_2*16)) + i5_2)\n",
            "              let cse_var_9: int32 = (cse_var_10 + 12544)\n",
            "               {\n",
            "                res_conv_2: Buffer(res_conv, int32, [157351936], [])[cse_var_10] = @tir.shift_right(res_conv_1[cse_var_10], 8, dtype=int32)\n",
            "                res_conv_2[cse_var_9] = @tir.shift_right(res_conv_1[cse_var_9], 8, dtype=int32)\n",
            "              }\n",
            "            }\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "      for (i1_3: int32, 0, 8) {\n",
            "        for (i2_3: int32, 0, 7) {\n",
            "          for (i3_3: int32, 0, 14) {\n",
            "            for (i5_3: int32, 0, 16) {\n",
            "              let cse_var_12: int32 = ((((i1_3*1568) + (i2_3*224)) + (i3_3*16)) + i5_3)\n",
            "              let cse_var_11: int32 = (cse_var_12 + 12544)\n",
            "               {\n",
            "                res_conv_3: Buffer(res_conv, int32, [157351936], [])[cse_var_12] = max(res_conv_2[cse_var_12], 0)\n",
            "                res_conv_3[cse_var_11] = max(res_conv_2[cse_var_11], 0)\n",
            "              }\n",
            "            }\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "      for (i1_4: int32, 0, 8) {\n",
            "        for (i2_4: int32, 0, 7) {\n",
            "          for (i3_4: int32, 0, 14) {\n",
            "            for (i5_4: int32, 0, 16) {\n",
            "              let cse_var_14: int32 = ((((i1_4*1568) + (i2_4*224)) + (i3_4*16)) + i5_4)\n",
            "              let cse_var_13: int32 = (cse_var_14 + 12544)\n",
            "               {\n",
            "                res_conv_4: Buffer(res_conv, int32, [157351936], [])[cse_var_14] = min(res_conv_3[cse_var_14], 127)\n",
            "                res_conv_4[cse_var_13] = min(res_conv_3[cse_var_13], 127)\n",
            "              }\n",
            "            }\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "      for (i1.inner: int32, 0, 8) {\n",
            "        for (i2.inner: int32, 0, 7) {\n",
            "          for (i3.inner: int32, 0, 14) {\n",
            "            for (i5_5: int32, 0, 16) {\n",
            "              let cse_var_18: int32 = (i2.inner*224)\n",
            "              let cse_var_17: int32 = (i3.inner*16)\n",
            "              let cse_var_16: int32 = ((((i1.inner*1568) + cse_var_18) + cse_var_17) + i5_5)\n",
            "              let cse_var_15: int32 = (((((i1.inner*3136) + (i2.outer*1568)) + cse_var_18) + cse_var_17) + i5_5)\n",
            "               {\n",
            "                res[cse_var_15] = cast(int8, res_conv_4[cse_var_16])\n",
            "                res[(cse_var_15 + 25088)] = cast(int8, res_conv_4[(cse_var_16 + 12544)])\n",
            "              }\n",
            "            }\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "    }\n",
            "  }\n",
            "}\n",
            "\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# VTA only supports 2 virtual threads\n",
        "v_threads = 2\n",
        "\n",
        "# Perform virtual thread split along output channel outer axis\n",
        "_, tx = s[res].split(oc_out, factor=v_threads)\n",
        "s[res].reorder(tx, b_out)\n",
        "s[res].bind(tx, te.thread_axis(\"cthread\"))\n",
        "\n",
        "# Let's look at the current TVM schedule after blocking and virtual threading\n",
        "print(tvm.lower(s, [data, kernel, res], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Lowering Copies 到 DMA Transfers\n",
        "\n",
        "接下来，设置相应的 on-chip VTA SRAM buffers 的 buffers 作用域。将 load 循环移动到 2D 卷积计算循环，以 stage 内存加载，以便它们适合 on-chip SRAM buffers。最后，用 DMA 复制 pragma 注解了 load/store 循环外轴，以便在 VTA 上执行大容量内存传输。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Set scope of SRAM buffers\n",
        "s[data_buf].set_scope(env.inp_scope)\n",
        "s[kernel_buf].set_scope(env.wgt_scope)\n",
        "s[res_conv].set_scope(env.acc_scope)\n",
        "s[res_shr].set_scope(env.acc_scope)\n",
        "s[res_min].set_scope(env.acc_scope)\n",
        "s[res_max].set_scope(env.acc_scope)\n",
        "\n",
        "# Block data and kernel cache reads\n",
        "s[data_buf].compute_at(s[res_conv], ic_out)\n",
        "s[kernel_buf].compute_at(s[res_conv], ic_out)\n",
        "\n",
        "# Use DMA copy pragma on DRAM->SRAM operations\n",
        "s[data_buf].pragma(s[data_buf].op.axis[0], env.dma_copy)\n",
        "s[kernel_buf].pragma(s[kernel_buf].op.axis[0], env.dma_copy)\n",
        "\n",
        "# Use DMA copy pragma on SRAM->DRAM operation in each result block\n",
        "# (this implies that these copies should be performed along b_inn,\n",
        "# or result axis 4)\n",
        "s[res].pragma(s[res].op.axis[4], env.dma_copy)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Lowering 计算到 VTA 计算 Intrinsics\n",
        "\n",
        "最后阶段是通过将二维卷积映射为张量 intrinsics，并将位移和剪切计算映射为向量 ALU，从而将计算循环 lower 到 VTA 硬件 intrinsics。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "collapsed": false
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "@main = primfn(data_1: handle, kernel_1: handle, res_1: handle) -> ()\n",
            "  attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n",
            "  buffers = {data: Buffer(data_2: Pointer(int8), int8, [50176], []),\n",
            "             kernel: Buffer(kernel_2: Pointer(int8), int8, [589824], []),\n",
            "             res: Buffer(res_2: Pointer(int8), int8, [50176], [])}\n",
            "  buffer_map = {data_1: data, kernel_1: kernel, res_1: res}\n",
            "  preflattened_buffer_map = {data_1: data_3: Buffer(data_2, int8, [1, 16, 14, 14, 1, 16], []), kernel_1: kernel_3: Buffer(kernel_2, int8, [16, 16, 3, 3, 16, 16], []), res_1: res_3: Buffer(res_2, int8, [1, 16, 14, 14, 1, 16], [])} {\n",
            "  @tir.vta.coproc_dep_push(3, 2, dtype=int32)\n",
            "  @tir.vta.coproc_dep_push(3, 2, dtype=int32)\n",
            "  for (i2.outer: int32, 0, 2) {\n",
            "    for (cthread.s: int32, 0, 2) {\n",
            "      attr [IterVar(vta: int32, (nullptr), \"ThreadIndex\", \"vta\")] \"coproc_scope\" = 2 {\n",
            "        @tir.vta.coproc_dep_pop(3, 2, dtype=int32)\n",
            "        attr [IterVar(vta, (nullptr), \"ThreadIndex\", \"vta\")] \"coproc_uop_scope\" = \"VTAPushGEMMOp\" {\n",
            "          @tir.call_extern(\"VTAUopLoopBegin\", 8, 98, 0, 0, dtype=int32)\n",
            "          @tir.call_extern(\"VTAUopLoopBegin\", 7, 14, 0, 0, dtype=int32)\n",
            "          for (j.init: int32, 0, 14) {\n",
            "            @tir.vta.uop_push(0, 1, ((cthread.s*784) + j.init), 0, 0, 0, 0, 0, dtype=int32)\n",
            "          }\n",
            "          @tir.call_extern(\"VTAUopLoopEnd\", dtype=int32)\n",
            "          @tir.call_extern(\"VTAUopLoopEnd\", dtype=int32)\n",
            "        }\n",
            "        @tir.vta.coproc_dep_push(2, 1, dtype=int32)\n",
            "      }\n",
            "    }\n",
            "    for (ic.outer: int32, 0, 16) {\n",
            "      let cse_var_6: int32 = (i2.outer*7)\n",
            "      let cse_var_5: int32 = (ic.outer*9)\n",
            "      let cse_var_4: int32 = max((1 - cse_var_6), 0)\n",
            "      let cse_var_3: int32 = max((cse_var_6 - 6), 0)\n",
            "      let cse_var_2: int32 = ((9 - cse_var_4) - cse_var_3)\n",
            "      let cse_var_1: int32 = ((((ic.outer*196) + (i2.outer*98)) + (cse_var_4*14)) - 14)\n",
            "       {\n",
            "        attr [IterVar(vta, (nullptr), \"ThreadIndex\", \"vta\")] \"coproc_scope\" = 1 {\n",
            "          @tir.vta.coproc_dep_pop(2, 1, dtype=int32)\n",
            "          @tir.call_extern(\"VTALoadBuffer2D\", @tir.tvm_thread_context(@tir.vta.command_handle(, dtype=handle), dtype=handle), data_2, cse_var_1, 14, cse_var_2, 14, 1, cse_var_4, 1, cse_var_3, 0, 2, dtype=int32)\n",
            "          @tir.call_extern(\"VTALoadBuffer2D\", @tir.tvm_thread_context(@tir.vta.command_handle(, dtype=handle), dtype=handle), kernel_2, cse_var_5, 9, 8, 144, 0, 0, 0, 0, 0, 1, dtype=int32)\n",
            "          @tir.vta.coproc_dep_push(1, 2, dtype=int32)\n",
            "        }\n",
            "        attr [IterVar(vta, (nullptr), \"ThreadIndex\", \"vta\")] \"coproc_scope\" = 1 {\n",
            "          @tir.vta.coproc_dep_pop(2, 1, dtype=int32)\n",
            "          @tir.call_extern(\"VTALoadBuffer2D\", @tir.tvm_thread_context(@tir.vta.command_handle(, dtype=handle), dtype=handle), data_2, cse_var_1, 14, cse_var_2, 14, 1, cse_var_4, 1, cse_var_3, 144, 2, dtype=int32)\n",
            "          @tir.call_extern(\"VTALoadBuffer2D\", @tir.tvm_thread_context(@tir.vta.command_handle(, dtype=handle), dtype=handle), kernel_2, (cse_var_5 + 1152), 9, 8, 144, 0, 0, 0, 0, 72, 1, dtype=int32)\n",
            "          @tir.vta.coproc_dep_push(1, 2, dtype=int32)\n",
            "        }\n",
            "        for (cthread.s_1: int32, 0, 2) {\n",
            "          attr [IterVar(vta, (nullptr), \"ThreadIndex\", \"vta\")] \"coproc_scope\" = 2 {\n",
            "            @tir.vta.coproc_dep_pop(1, 2, dtype=int32)\n",
            "            attr [IterVar(vta, (nullptr), \"ThreadIndex\", \"vta\")] \"coproc_uop_scope\" = \"VTAPushGEMMOp\" {\n",
            "              @tir.call_extern(\"VTAUopLoopBegin\", 8, 98, 0, 9, dtype=int32)\n",
            "              @tir.call_extern(\"VTAUopLoopBegin\", 7, 14, 16, 0, dtype=int32)\n",
            "              for (dy: int32, 0, 3) {\n",
            "                for (dx: int32, 0, 3) {\n",
            "                  for (j: int32, 0, 14) {\n",
            "                    @tir.vta.uop_push(0, 0, ((cthread.s_1*784) + j), ((((cthread.s_1*144) + (dy*16)) + j) + dx), (((cthread.s_1*72) + (dy*3)) + dx), 0, 0, 0, dtype=int32)\n",
            "                  }\n",
            "                }\n",
            "              }\n",
            "              @tir.call_extern(\"VTAUopLoopEnd\", dtype=int32)\n",
            "              @tir.call_extern(\"VTAUopLoopEnd\", dtype=int32)\n",
            "            }\n",
            "            @tir.vta.coproc_dep_push(2, 1, dtype=int32)\n",
            "          }\n",
            "        }\n",
            "      }\n",
            "    }\n",
            "    @tir.vta.coproc_dep_pop(2, 1, dtype=int32)\n",
            "    @tir.vta.coproc_dep_pop(2, 1, dtype=int32)\n",
            "    for (cthread.s_2: int32, 0, 2) {\n",
            "      let cse_var_7: int32 = (cthread.s_2*784)\n",
            "      attr [IterVar(vta, (nullptr), \"ThreadIndex\", \"vta\")] \"coproc_scope\" = 2 {\n",
            "        attr [IterVar(vta, (nullptr), \"ThreadIndex\", \"vta\")] \"coproc_uop_scope\" = \"VTAPushALUOp\" {\n",
            "          @tir.call_extern(\"VTAUopLoopBegin\", 784, 1, 1, 0, dtype=int32)\n",
            "          @tir.vta.uop_push(1, 0, cse_var_7, cse_var_7, 0, 3, 1, 8, dtype=int32)\n",
            "          @tir.call_extern(\"VTAUopLoopEnd\", dtype=int32)\n",
            "        }\n",
            "        attr [IterVar(vta, (nullptr), \"ThreadIndex\", \"vta\")] \"coproc_uop_scope\" = \"VTAPushALUOp\" {\n",
            "          @tir.call_extern(\"VTAUopLoopBegin\", 784, 1, 1, 0, dtype=int32)\n",
            "          @tir.vta.uop_push(1, 0, cse_var_7, cse_var_7, 0, 1, 1, 0, dtype=int32)\n",
            "          @tir.call_extern(\"VTAUopLoopEnd\", dtype=int32)\n",
            "        }\n",
            "        attr [IterVar(vta, (nullptr), \"ThreadIndex\", \"vta\")] \"coproc_uop_scope\" = \"VTAPushALUOp\" {\n",
            "          @tir.call_extern(\"VTAUopLoopBegin\", 784, 1, 1, 0, dtype=int32)\n",
            "          @tir.vta.uop_push(1, 0, cse_var_7, cse_var_7, 0, 0, 1, 127, dtype=int32)\n",
            "          @tir.call_extern(\"VTAUopLoopEnd\", dtype=int32)\n",
            "        }\n",
            "        @tir.vta.coproc_dep_push(2, 3, dtype=int32)\n",
            "      }\n",
            "    }\n",
            "    for (cthread.s_3: int32, 0, 2) {\n",
            "      attr [IterVar(vta, (nullptr), \"ThreadIndex\", \"vta\")] \"coproc_scope\" = 3 {\n",
            "        @tir.vta.coproc_dep_pop(2, 3, dtype=int32)\n",
            "        for (i1.inner: int32, 0, 8) {\n",
            "          for (i2.inner: int32, 0, 7) {\n",
            "            for (i3.inner: int32, 0, 14) {\n",
            "              let cse_var_8: int32 = (i2.inner*14)\n",
            "              @tir.call_extern(\"VTAStoreBuffer2D\", @tir.tvm_thread_context(@tir.vta.command_handle(, dtype=handle), dtype=handle), ((((cthread.s_3*784) + (i1.inner*98)) + cse_var_8) + i3.inner), 4, res_2, (((((cthread.s_3*1568) + (i1.inner*196)) + (i2.outer*98)) + cse_var_8) + i3.inner), 1, 1, 1, dtype=int32)\n",
            "            }\n",
            "          }\n",
            "        }\n",
            "        @tir.vta.coproc_dep_push(3, 2, dtype=int32)\n",
            "      }\n",
            "    }\n",
            "  }\n",
            "  @tir.vta.coproc_dep_pop(3, 2, dtype=int32)\n",
            "  @tir.vta.coproc_dep_pop(3, 2, dtype=int32)\n",
            "  @tir.vta.coproc_sync(, dtype=int32)\n",
            "}\n",
            "\n",
            "\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "[16:02:31] /media/pc/data/4tb/lxw/books/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=128\n"
          ]
        }
      ],
      "source": [
        "# Apply tensorization over the batch tensor tile axis\n",
        "s[res_conv].tensorize(b_tns, env.gemm)\n",
        "\n",
        "# Add an ALU pragma over the shift and clipping operations\n",
        "s[res_shr].pragma(s[res_shr].op.axis[0], env.alu)\n",
        "s[res_min].pragma(s[res_min].op.axis[0], env.alu)\n",
        "s[res_max].pragma(s[res_max].op.axis[0], env.alu)\n",
        "\n",
        "# Let's look at the final lowered TVM schedule after lowering memory\n",
        "# loads/stores down to DMA copy intrinsics, and the computation down to\n",
        "# VTA compute intrinsics.\n",
        "print(vta.lower(s, [data, kernel, res], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## TVM 计算和验证\n",
        "\n",
        "在指定调度之后，可以将其编译为 TVM 函数。保存模块，这样就可以通过 RPC 发送它。运行该函数并对 numpy 实现进行验证，以确保其正确性。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "collapsed": false
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "[16:02:32] /media/pc/data/4tb/lxw/books/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=128\n",
            "/media/workspace/anaconda3/envs/mx/lib/python3.10/site-packages/tvm/driver/build_module.py:263: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.\n",
            "  warnings.warn(\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Execution statistics:\n",
            "\tinp_load_nbytes :           114688\n",
            "\twgt_load_nbytes :          1179648\n",
            "\tacc_load_nbytes :                0\n",
            "\tuop_load_nbytes :             1144\n",
            "\tout_store_nbytes:            50176\n",
            "\tgemm_counter    :           451584\n",
            "\talu_counter     :             9408\n",
            "Successful 2D convolution test!\n"
          ]
        }
      ],
      "source": [
        "# This library facilitates 2D convolution testing\n",
        "from tvm.topi.testing import conv2d_nchw_python\n",
        "\n",
        "# Compile the TVM module\n",
        "with vta.build_config(disabled_pass={\"tir.CommonSubexprElimTIR\"}):\n",
        "    my_conv = vta.build(\n",
        "        s, [data, kernel, res], tvm.target.Target(\"ext_dev\", host=env.target_host), name=\"my_conv\"\n",
        "    )\n",
        "temp = utils.tempdir()\n",
        "my_conv.save(temp.relpath(\"conv2d.o\"))\n",
        "remote.upload(temp.relpath(\"conv2d.o\"))\n",
        "f = remote.load_module(\"conv2d.o\")\n",
        "\n",
        "# Get the remote device context\n",
        "ctx = remote.ext_dev(0)\n",
        "\n",
        "# Initialize the data and kernel arrays randomly in the int range\n",
        "# of (-128, 128] in NCHW layout\n",
        "data_np = np.random.randint(-128, 128, size=(batch_size, in_channels, height, width)).astype(\n",
        "    data.dtype\n",
        ")\n",
        "kernel_np = np.random.randint(\n",
        "    -128, 128, size=(out_channels, in_channels, kernel_h, kernel_w)\n",
        ").astype(kernel.dtype)\n",
        "\n",
        "# Apply packing to the data and kernel arrays from a 2D NCHW\n",
        "# to a 4D NCHWnc packed layout\n",
        "data_packed = data_np.reshape(\n",
        "    batch_size // env.BATCH, env.BATCH, in_channels // env.BLOCK_IN, env.BLOCK_IN, height, width\n",
        ").transpose((0, 2, 4, 5, 1, 3))\n",
        "\n",
        "kernel_packed = kernel_np.reshape(\n",
        "    out_channels // env.BLOCK_OUT,\n",
        "    env.BLOCK_OUT,\n",
        "    in_channels // env.BLOCK_IN,\n",
        "    env.BLOCK_IN,\n",
        "    kernel_h,\n",
        "    kernel_w,\n",
        ").transpose((0, 2, 4, 5, 1, 3))\n",
        "\n",
        "# Format the input/output arrays with tvm.nd.array to the DLPack standard\n",
        "data_nd = tvm.nd.array(data_packed, ctx)\n",
        "kernel_nd = tvm.nd.array(kernel_packed, ctx)\n",
        "res_nd = tvm.nd.array(np.zeros(output_shape).astype(res.dtype), ctx)\n",
        "\n",
        "# Clear stats\n",
        "if env.TARGET in [\"sim\", \"tsim\"]:\n",
        "    simulator.clear_stats()\n",
        "\n",
        "# Invoke the module to perform the computation\n",
        "f(data_nd, kernel_nd, res_nd)\n",
        "\n",
        "# Verify against numpy implementation\n",
        "res_ref = conv2d_nchw_python(\n",
        "    data_np.astype(env.acc_dtype),\n",
        "    kernel_np.astype(env.acc_dtype),\n",
        "    (stride_h, stride_w),\n",
        "    (pad_h, pad_w),\n",
        ").astype(env.acc_dtype)\n",
        "res_ref = res_ref >> env.INP_WIDTH\n",
        "res_ref = np.clip(res_ref, 0, inp_max)\n",
        "res_ref = res_ref.astype(res.dtype)\n",
        "res_ref = res_ref.reshape(\n",
        "    (\n",
        "        batch_size // env.BATCH,\n",
        "        env.BATCH,\n",
        "        out_channels // env.BLOCK_OUT,\n",
        "        env.BLOCK_OUT,\n",
        "        fout_height,\n",
        "        fout_width,\n",
        "    )\n",
        ").transpose((0, 2, 4, 5, 1, 3))\n",
        "tvm.testing.assert_allclose(res_ref, res_nd.numpy())\n",
        "\n",
        "# Print stats\n",
        "if env.TARGET in [\"sim\", \"tsim\"]:\n",
        "    sim_stats = simulator.stats()\n",
        "    print(\"Execution statistics:\")\n",
        "    for k, v in sim_stats.items():\n",
        "        print(\"\\t{:<16}: {:>16}\".format(k, v))\n",
        "\n",
        "print(\"Successful 2D convolution test!\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 小结\n",
        "\n",
        "本教程演示如何使用 TVM 调度原语 lower 硬件加速器 intrinsics 的 2D 卷积，利用特定于硬件的优化，比如使用带虚拟线程的隐藏延迟。"
      ]
    }
  ],
  "metadata": {
    "interpreter": {
      "hash": "ee40e4cbda3c4716866f133b45765e0887afdbc9aa3bd872ab229f889d521355"
    },
    "kernelspec": {
      "display_name": "Python 3.10.4 ('mx')",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}