{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(tutorial-tensor-expr-get-started)=\n", "# 使用张量表达式处理算子\n", "\n", "**Author**: [Tianqi Chen](https://tqchen.github.io)\n", "\n", "在本教程中,把注意力转向 TVM 如何使用张量表达式(Tensor Expression,简称 TE)定义张量计算并应用循环优化。TE 以纯函数式语言描述张量计算(即每个表达式都没有副作用)。从 TVM 的整体来看,Relay 将计算描述为一组算子,这些算子都可以表示为 TE 表达式,每个 TE 表达式都接受输入张量并生成输出张量。\n", "\n", "这是关于 TVM 中张量表达式语言的介绍性教程。TVM 使用领域专用张量表达式来进行有效的内核构建。通过两个使用张量表达式语言的例子,演示基本工作流程。第一个例子介绍了 TE 和用向量加法进行调度。第二个例子扩展了这些概念,用 TE 逐步优化矩阵乘法。这个矩阵乘法的例子将作为未来涵盖 TVM 更高级功能的教程的基础。\n", "\n", "## 例 1:为 CPU 编写和调度 TE 中的向量加法\n", "\n", "让我们看看 Python 中的例子,将实现向量加法的 TE,然后是针对 CPU 的调度。首先初始化 TVM 环境。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import warnings\n", "import numpy as np\n", "# import env\n", "\n", "import tvm\n", "from tvm import testing\n", "from tvm import te\n", "from tvm.target import Target\n", "\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果你能确定你所针对的 CPU 并指定它,你将获得更好的性能。如果你使用 LLVM,你可以从命令 ``llc --version`` 中得到这个信息,以获得 CPU 类型,你可以检查 ``/proc/cpuinfo``,了解你的处理器可能支持的额外扩展。例如,你可以使用 ``llvm -mcpu=skylake-avx512`` 来获取带有 AVX-512 指令的 CPU。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tgt = Target(target=\"llvm\", host=\"llvm\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 描述张量计算\n", "\n", "描述矢量加法的计算。TVM 采用了张量语义,每个中间结果都表示为一个多维数组。用户需要描述生成张量的计算规则。首先定义符号变量 `n` 来表示形状。然后定义两个占位符张量 `A` 和 `B`,具有给定的形状 `(n,)`。然后用 ``compute`` 算子来描述结果张量 `C`。``compute`` 定义了计算,其输出符合指定的张量形状,计算将在张量中的每个位置进行,由 `lambda` 函数定义。注意,虽然 `n` 是变量,但它定义了 `A`、`B` 和 `C` 张量之间的一致形状。记住,在这个阶段没有实际的计算发生,因为只是声明了计算应该如何进行。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n = te.var(\"n\")\n", "A = te.placeholder((n,), name=\"A\")\n", "B = te.placeholder((n,), name=\"B\")\n", "C = te.compute(A.shape, lambda i: A[i] + B[i], name=\"C\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Lambda 函数\n", "`te.compute` 方法的第二个参数是执行计算的函数。在这个例子中,使用一个匿名函数,也被称为 `lambda` 函数,来定义计算,在本例中是对 `A` 和 `B` 的第 `i` 个元素进行加法。\n", "```\n", "\n", "### 为计算创建默认的调度\n", "\n", "虽然上面几行描述了计算规则,但可以用许多不同的方式计算 `C`,以适应不同的设备。对于有多个轴的张量,你可以选择先迭代哪个轴,或者计算可以分成不同的线程。TVM 要求用户提供调度,这是关于计算应该如何进行的描述。TE 中的调度操作可以改变循环顺序,在不同的线程中分割计算,并将数据块分组,以及其他操作。调度背后的重要概念是,它们只描述计算是如何进行的,所以同一个 TE 的不同调度会产生相同的结果。\n", "\n", "TVM 允许你创建一个自然的调度,通过以行为单位迭代的方式进行 `C` 运算。\n", "```c\n", "for (int i = 0; i < n; ++i) {\n", " C[i] = A[i] + B[i];\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "s = te.create_schedule(C.op)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 为计算创建默认调度\n", "\n", "有了 TE 表达式和调度,就可以为目标语言和架构(在这里是指 LLVM 和 CPU)生成可运行的代码。向 TVM 提供调度、调度中的 TE 表达式的列表、目标和主机,以及我们要产生的函数的名称。输出的结果是类型消除的(type-erased)函数,可以直接从 Python 中调用。\n", "\n", "在下面一行,使用 `tvm.build` 来创建函数。`build` 函数需要调度、所需的函数签名(包括输入和输出)以及要编译的目标语言。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fadd = tvm.build(s, [A, B, C], tgt, name=\"myadd\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "运行这个函数,并将其输出与 `numpy` 中的相同计算进行比较。编译后的 TVM 函数暴露了简洁的 C 语言 API,可以从任何语言调用。首先创建设备,也就是 TVM 可以编译调度的设备(本例中为 CPU)。在本例中,该设备是 LLVM CPU 目标。然后可以初始化设备中的张量,并执行自定义的加法运算。为了验证计算是否正确,我们可以将 `c` 张量的输出结果与 `numpy` 进行的相同计算进行比较。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dev = tvm.device(tgt.kind.name, 0)\n", "\n", "n = 1024\n", "a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), dev)\n", "b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), dev)\n", "c = tvm.nd.array(np.zeros(n, dtype=C.dtype), dev)\n", "fadd(a, b, c)\n", "testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "为了得到这个版本与 numpy 相比有多快的比较,创建辅助函数来运行 TVM 生成代码的配置文件。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Numpy running time: 0.000016\n", "naive: 0.000013\n" ] } ], "source": [ "import timeit\n", "\n", "np_repeat = 100\n", "np_running_time = timeit.timeit(\n", " setup=\"import numpy\\n\"\n", " \"n = 32768\\n\"\n", " 'dtype = \"float32\"\\n'\n", " \"a = numpy.random.rand(n, 1).astype(dtype)\\n\"\n", " \"b = numpy.random.rand(n, 1).astype(dtype)\\n\",\n", " stmt=\"answer = a + b\",\n", " number=np_repeat,\n", ")\n", "print(\"Numpy running time: %f\" % (np_running_time / np_repeat))\n", "\n", "\n", "def evaluate_addition(func, target, optimization, log):\n", " dev = tvm.device(target.kind.name, 0)\n", " n = 32768\n", " a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), dev)\n", " b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), dev)\n", " c = tvm.nd.array(np.zeros(n, dtype=C.dtype), dev)\n", "\n", " evaluator = func.time_evaluator(func.entry_name, dev, number=10)\n", " mean_time = evaluator(a, b, c).mean\n", " print(\"%s: %f\" % (optimization, mean_time))\n", "\n", " log.append((optimization, mean_time))\n", "\n", "\n", "log = [(\"numpy\", np_running_time / np_repeat)]\n", "evaluate_addition(fadd, tgt, \"naive\", log=log)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 更新调度以使用并行\n", "\n", "现在已经说明了 TE 的基本原理,更深入地了解调度的作用,以及如何使用它们来为不同的架构调度张量表达式。调度是一系列应用于表达式的步骤,以多种不同的方式对其进行转换。当调度应用于 TE 中的表达式时,输入和输出保持不变,但在编译时,表达式的实现可以改变。在默认的调度中,这个张量加法是并行运行的,但是很容易在所有的处理器线程中进行并行化。可以将并行调度的操作应用到计算中。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s[C].parallel(C.op.axis[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`tvm.lower` 命令将生成 TE 的中间表示(IR),以及相应的调度。通过在应用不同的调度操作时降低表达式,可以看到调度对计算的顺序的影响。使用旗标 `simple_mode=True` 来返回一个可读的 C 风格语句。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()\n", " attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n", " buffers = {A: Buffer(A_2: Pointer(float32), float32, [(stride: int32*n: int32)], [], type=\"auto\"),\n", " B: Buffer(B_2: Pointer(float32), float32, [(stride_1: int32*n)], [], type=\"auto\"),\n", " C: Buffer(C_2: Pointer(float32), float32, [(stride_2: int32*n)], [], type=\"auto\")}\n", " buffer_map = {A_1: A, B_1: B, C_1: C}\n", " preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [n], [stride], type=\"auto\"), B_1: B_3: Buffer(B_2, float32, [n], [stride_1], type=\"auto\"), C_1: C_3: Buffer(C_2, float32, [n], [stride_2], type=\"auto\")} {\n", " for (i: int32, 0, n) \"parallel\" {\n", " C[(i*stride_2)] = (A[(i*stride)] + B[(i*stride_1)])\n", " }\n", "}\n", "\n", "\n" ] } ], "source": [ "print(tvm.lower(s, [A, B, C], simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "现在,TVM 有可能在独立的线程上运行这些块。编译并运行这个应用了并行操作的新调度。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "parallel: 0.000022\n" ] } ], "source": [ "fadd_parallel = tvm.build(s, [A, B, C], tgt, name=\"myadd_parallel\")\n", "fadd_parallel(a, b, c)\n", "\n", "tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())\n", "\n", "evaluate_addition(fadd_parallel, tgt, \"parallel\", log=log)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 更新调度以使用矢量化\n", "\n", "现代的 CPU 也有能力对浮点值进行 SIMD 操作,我们可以对我们的计算表达式应用另一个调度,以利用这一优势。实现这一点需要多个步骤:首先,我们必须使用分割调度原语将调度分割成内循环和外循环。内循环可以使用矢量化调度原语来使用 SIMD 指令,然后外循环可以使用并行调度原语来并行化。选择分割因子为你的 CPU 上的线程数。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vector: 0.000017\n", "@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()\n", " attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n", " buffers = {A: Buffer(A_2: Pointer(float32), float32, [(stride: int32*n: int32)], [], type=\"auto\"),\n", " B: Buffer(B_2: Pointer(float32), float32, [(stride_1: int32*n)], [], type=\"auto\"),\n", " C: Buffer(C_2: Pointer(float32), float32, [(stride_2: int32*n)], [], type=\"auto\")}\n", " buffer_map = {A_1: A, B_1: B, C_1: C}\n", " preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [n], [stride], type=\"auto\"), B_1: B_3: Buffer(B_2, float32, [n], [stride_1], type=\"auto\"), C_1: C_3: Buffer(C_2, float32, [n], [stride_2], type=\"auto\")} {\n", " for (i.outer: int32, 0, floordiv((n + 3), 4)) \"parallel\" {\n", " for (i.inner.s: int32, 0, 4) {\n", " if @tir.likely((((i.outer*4) + i.inner.s) < n), dtype=bool) {\n", " let cse_var_1: int32 = ((i.outer*4) + i.inner.s)\n", " C[(cse_var_1*stride_2)] = (A[(cse_var_1*stride)] + B[(cse_var_1*stride_1)])\n", " }\n", " }\n", " }\n", "}\n", "\n", "\n" ] } ], "source": [ "# 重新创建调度,因为我们用并行操作修改了它\n", "n = te.var(\"n\")\n", "A = te.placeholder((n,), name=\"A\")\n", "B = te.placeholder((n,), name=\"B\")\n", "C = te.compute(A.shape, lambda i: A[i] + B[i], name=\"C\")\n", "\n", "s = te.create_schedule(C.op)\n", "\n", "# 这个 factor 应该被选择来匹配适合你的 CPU 的线程数。\n", "# 这将根据架构的不同而变化,\n", "# 但一个好的规则是将这个 factor 设置为等于可用的 CPU 内核数。\n", "factor = 4\n", "\n", "outer, inner = s[C].split(C.op.axis[0], factor=factor)\n", "s[C].parallel(outer)\n", "s[C].vectorize(inner)\n", "\n", "fadd_vector = tvm.build(s, [A, B, C], tgt, name=\"myadd_parallel\")\n", "\n", "evaluate_addition(fadd_vector, tgt, \"vector\", log=log)\n", "\n", "print(tvm.lower(s, [A, B, C], simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 比较不同的调度\n", "\n", "我们现在可以比较不同的调度:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Operator\t Timing\t Performance\n", " numpy\t1.5522249741479753e-05\t 1.0\n", " naive\t1.3208600000000001e-05\t 0.8509462365305823\n", " parallel\t 2.17076e-05\t 1.3984828463358165\n", " vector\t1.6507899999999998e-05\t 1.0634991882578924\n" ] } ], "source": [ "baseline = log[0][1]\n", "print(\"%s\\t%s\\t%s\" % (\"Operator\".rjust(20), \"Timing\".rjust(20), \"Performance\".rjust(20)))\n", "for result in log:\n", " print(\n", " \"%s\\t%s\\t%s\"\n", " % (result[0].rjust(20), str(result[1]).rjust(20), str(result[1] / baseline).rjust(20))\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} 代码特殊化\n", "正如你可能已经注意到的,`A`、`B` 和 `C` 的声明都采取了相同的形状参数 `n`。TVM 将利用这一点,只向内核传递一个形状参数,正如你在打印的设备代码中发现的那样。这是专业化的一种形式。\n", "\n", "在主机端,TVM 会自动生成检查代码,检查参数中的约束。所以如果你把不同形状的数组传入 `fadd`,就会出现错误。\n", "\n", "我们可以做更多的特殊化。例如,我们可以在计算声明中写 `n = tvm.runtime.convert(1024)`,而不是 `n = te.var(\"n\")` 。生成的函数将只接受长度为 1024 的向量。\n", "```\n", "\n", "我们已经定义、调度并编译了向量加法运算符,然后我们能够在 TVM 运行时上执行它。我们可以将运算符保存为一个库,然后我们可以在以后使用 TVM 运行时加载它。\n", "\n", "### 针对 GPU 的向量加法(可选)\n", "\n", "TVM 能够针对多种架构。在下一个例子中,将针对 GPU 的向量加法进行编译。" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "-----GPU code-----\n", "\n", "#ifdef _WIN32\n", " using uint = unsigned int;\n", " using uchar = unsigned char;\n", " using ushort = unsigned short;\n", " using int64_t = long long;\n", " using uint64_t = unsigned long long;\n", "#else\n", " #define uint unsigned int\n", " #define uchar unsigned char\n", " #define ushort unsigned short\n", " #define int64_t long long\n", " #define uint64_t unsigned long long\n", "#endif\n", "extern \"C\" __global__ void __launch_bounds__(64) myadd_kernel0(float* __restrict__ C, float* __restrict__ A, float* __restrict__ B, int n, int stride, int stride1, int stride2) {\n", " if (((int)blockIdx.x) < (n >> 6)) {\n", " C[(((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) * stride)] = (A[(((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) * stride1)] + B[(((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) * stride2)]);\n", " } else {\n", " if (((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) < n) {\n", " C[(((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) * stride)] = (A[(((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) * stride1)] + B[(((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) * stride2)]);\n", " }\n", " }\n", "}\n", "\n", "\n" ] } ], "source": [ "# 如果你想运行这段代码,改变 ``run_cuda = True``。\n", "# 注意,在默认情况下,这个例子不会在 docs CI 中运行。\n", "\n", "run_cuda = True\n", "if run_cuda:\n", " # Change this target to the correct backend for you gpu. For example: cuda (NVIDIA GPUs),\n", " # rocm (Radeon GPUS), OpenCL (opencl).\n", " tgt_gpu = tvm.target.Target(target=\"cuda\", host=\"llvm\")\n", "\n", " # Recreate the schedule\n", " n = te.var(\"n\")\n", " A = te.placeholder((n,), name=\"A\")\n", " B = te.placeholder((n,), name=\"B\")\n", " C = te.compute(A.shape, lambda i: A[i] + B[i], name=\"C\")\n", " print(type(C))\n", "\n", " s = te.create_schedule(C.op)\n", "\n", " bx, tx = s[C].split(C.op.axis[0], factor=64)\n", "\n", " ################################################################################\n", " # Finally we must bind the iteration axis bx and tx to threads in the GPU\n", " # compute grid. The naive schedule is not valid for GPUs, and these are\n", " # specific constructs that allow us to generate code that runs on a GPU.\n", "\n", " s[C].bind(bx, te.thread_axis(\"blockIdx.x\"))\n", " s[C].bind(tx, te.thread_axis(\"threadIdx.x\"))\n", "\n", " ######################################################################\n", " # Compilation\n", " # -----------\n", " # After we have finished specifying the schedule, we can compile it\n", " # into a TVM function. By default TVM compiles into a type-erased\n", " # function that can be directly called from the python side.\n", " #\n", " # In the following line, we use tvm.build to create a function.\n", " # The build function takes the schedule, the desired signature of the\n", " # function (including the inputs and outputs) as well as target language\n", " # we want to compile to.\n", " #\n", " # The result of compilation fadd is a GPU device function (if GPU is\n", " # involved) as well as a host wrapper that calls into the GPU\n", " # function. fadd is the generated host wrapper function, it contains\n", " # a reference to the generated device function internally.\n", "\n", " fadd = tvm.build(s, [A, B, C], target=tgt_gpu, name=\"myadd\")\n", "\n", " ################################################################################\n", " # The compiled TVM function exposes a concise C API that can be invoked from\n", " # any language.\n", " #\n", " # We provide a minimal array API in python to aid quick testing and prototyping.\n", " # The array API is based on the `DLPack `_ standard.\n", " #\n", " # - We first create a GPU device.\n", " # - Then tvm.nd.array copies the data to the GPU.\n", " # - ``fadd`` runs the actual computation\n", " # - ``numpy()`` copies the GPU array back to the CPU (so we can verify correctness).\n", " #\n", " # Note that copying the data to and from the memory on the GPU is a required step.\n", "\n", " dev = tvm.device(tgt_gpu.kind.name, 0)\n", "\n", " n = 1024\n", " a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), dev)\n", " b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), dev)\n", " c = tvm.nd.array(np.zeros(n, dtype=C.dtype), dev)\n", " fadd(a, b, c)\n", " tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())\n", "\n", " ################################################################################\n", " # Inspect the Generated GPU Code\n", " # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n", " # You can inspect the generated code in TVM. The result of tvm.build is a TVM\n", " # Module. fadd is the host module that contains the host wrapper, it also\n", " # contains a device module for the CUDA (GPU) function.\n", " #\n", " # The following code fetches the device module and prints the content code.\n", "\n", " if (\n", " tgt_gpu.kind.name == \"cuda\"\n", " or tgt_gpu.kind.name == \"rocm\"\n", " or tgt_gpu.kind.name.startswith(\"opencl\")\n", " ):\n", " dev_module = fadd.imported_modules[0]\n", " print(\"-----GPU code-----\")\n", " print(dev_module.get_source())\n", " else:\n", " print(fadd.get_source())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 保存和加载已编译的模块\n", "\n", "除了运行时编译,我们还可以将编译后的模块保存到文件中,以后再加载回来。\n", "\n", "下面的代码首先执行了以下步骤:\n", "\n", "- 它将编译后的主机模块保存到一个对象文件中。\n", "- 然后它将设备模块保存到 ptx 文件中。\n", "- `cc.create_shared` 调用编译器(gcc)来创建共享库" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['myadd.so', 'myadd.o']\n" ] } ], "source": [ "from tvm.contrib import cc\n", "from tvm.contrib import utils\n", "\n", "temp = utils.tempdir()\n", "fadd.save(temp.relpath(\"myadd.o\"))\n", "if tgt.kind.name == \"cuda\":\n", " fadd.imported_modules[0].save(temp.relpath(\"myadd.ptx\"))\n", "if tgt.kind.name == \"rocm\":\n", " fadd.imported_modules[0].save(temp.relpath(\"myadd.hsaco\"))\n", "if tgt.kind.name.startswith(\"opencl\"):\n", " fadd.imported_modules[0].save(temp.relpath(\"myadd.cl\"))\n", "cc.create_shared(temp.relpath(\"myadd.so\"), [temp.relpath(\"myadd.o\")])\n", "print(temp.listdir())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} 模块存储格式\n", "CPU(主机)模块被直接保存为共享库(`.so`)。设备代码可以有多种自定义格式。在我们的例子中,设备代码被保存在 ptx 中,还有一个元数据 json 文件。它们可以通过导入分离加载和链接。\n", "```\n", "\n", "### 加载已编译的模块\n", "\n", "我们可以从文件系统中加载编译好的模块并运行代码。下面的代码分别加载主机和设备模块,并将它们链接在一起。我们可以验证新加载的功能是否工作。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fadd1 = tvm.runtime.load_module(temp.relpath(\"myadd.so\"))\n", "if tgt.kind.name == \"cuda\":\n", " fadd1_dev = tvm.runtime.load_module(temp.relpath(\"myadd.ptx\"))\n", " fadd1.import_module(fadd1_dev)\n", "\n", "if tgt.kind.name == \"rocm\":\n", " fadd1_dev = tvm.runtime.load_module(temp.relpath(\"myadd.hsaco\"))\n", " fadd1.import_module(fadd1_dev)\n", "\n", "if tgt.kind.name.startswith(\"opencl\"):\n", " fadd1_dev = tvm.runtime.load_module(temp.relpath(\"myadd.cl\"))\n", " fadd1.import_module(fadd1_dev)\n", "\n", "fadd1(a, b, c)\n", "tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 把所有东西都装进库\n", "\n", "在上面的例子中,分别存储了设备和主机代码。TVM 也支持将所有东西作为共享库导出。在底层,将设备模块打包成二进制的 blob,并将它们与主机代码连接在一起。目前支持打包 Metal、OpenCL 和 CUDA 模块。" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fadd.export_library(temp.relpath(\"myadd_pack.so\"))\n", "fadd2 = tvm.runtime.load_module(temp.relpath(\"myadd_pack.so\"))\n", "fadd2(a, b, c)\n", "tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} 运行时 API 和线程安全\n", "TVM 的编译模块并不依赖于 TVM 编译器。相反,它们只依赖于一个最小的运行时库。TVM 运行库包装了设备驱动程序,并提供线程安全和设备无关的调用到编译的函数。\n", "\n", "这意味着你可以从任何线程、任何 GPU 上调用已编译的 TVM 函数,只要你已经为该 GPU 编译了代码。\n", "```\n", "\n", "## 生成 OpenCL 代码\n", "\n", "TVM 提供代码生成功能到多个后端。我们还可以生成 OpenCL 代码或 LLVM 代码,在 CPU 后端运行。\n", "\n", "下面的代码块生成 OpenCL 代码,在 OpenCL 设备上创建阵列,并验证代码的正确性。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "if tgt.kind.name.startswith(\"opencl\"):\n", " fadd_cl = tvm.build(s, [A, B, C], tgt, name=\"myadd\")\n", " print(\"------opencl code------\")\n", " print(fadd_cl.imported_modules[0].get_source())\n", " dev = tvm.cl(0)\n", " n = 1024\n", " a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), dev)\n", " b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), dev)\n", " c = tvm.nd.array(np.zeros(n, dtype=C.dtype), dev)\n", " fadd_cl(a, b, c)\n", " tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} TE 调度原语\n", "TVM 包括一些不同的调度原语:\n", "\n", "- `split`:将一个指定的轴按定义的因子(factor)分成两个轴。\n", "- `tile`:将一个计算按定义的 factor 分成两个轴。\n", "- `fuse`:融合一个计算的两个连续轴。\n", "- `reorder`:可以将一个计算的轴重新排序到一个定义的顺序。\n", "- `bind`:可以将一个计算绑定到一个特定的线程,在GPU编程中很有用。\n", "- `compute_at`:默认情况下,TVM 会在函数的最外层计算张量,也就是默认的根。`compute_at` 指定一个张量应该在另一个运算符的第一个计算轴上计算。\n", "- `compute_inline`:当标记为内联时,一个计算将被展开,然后插入到需要张量的地址中。\n", "- `compute_root`:将一个计算移到函数的最外层,或根部。这意味着该阶段的计算将在进入下一阶段之前被完全计算。\n", "\n", "这些原语的完整描述可以在 [调度原语](schedule_primitives) 文档页中找到。\n", "```\n", "\n", "## 实例2:用 TE 手动优化矩阵乘法\n", "\n", "现在我们将考虑第二个更高级的例子,演示仅用 18 行 python 代码,TVM 如何将一个普通的矩阵乘法操作加快 18 倍。\n", "\n", "矩阵乘法是计算密集型运算。为了获得良好的 CPU 性能,有两个重要的优化措施:\n", "\n", "1. 提高内存访问的高速缓存命中率。复杂的数值计算和热点内存(hot-spot memory)访问都可以通过高缓存命中率(high cache hit rate)来加速。这就要求我们将原点内存(origin )访问模式转化为符合高速缓存策略的模式。\n", "2. SIMD(单指令多数据),也被称为矢量处理单元。在每个周期中,SIMD 可以处理一小批数据,而不是处理一个单一的值。这就要求我们将循环体中的数据访问模式转化为统一模式,以便 LLVM 后端可以将其降低到 SIMD。\n", "\n", "本教程中使用的技术是 [资源库](https://github.com/flame/how-to-optimize-gemm) 中提到的技巧的一个子集。其中一些已经被 TVM 抽象自动应用了,但由于 TVM 的限制,其中一些不能自动应用。\n", "\n", "### 准备和性能基线\n", "\n", "我们首先收集 `numpy` 实现矩阵乘法的性能数据。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Numpy running time: 0.023805\n" ] } ], "source": [ "import tvm\n", "import tvm.testing\n", "from tvm import te\n", "import numpy\n", "\n", "# The size of the matrix\n", "# (M, K) x (K, N)\n", "# You are free to try out different shapes, sometimes TVM optimization outperforms numpy with MKL.\n", "M = 1024\n", "K = 1024\n", "N = 1024\n", "\n", "# The default tensor data type in tvm\n", "dtype = \"float32\"\n", "\n", "# You will want to adjust the target to match any CPU vector extensions you\n", "# might have. For example, if you're using using Intel AVX2 (Advanced Vector\n", "# Extensions) ISA for SIMD, you can get the best performance by changing the\n", "# following line to ``llvm -mcpu=core-avx2``, or specific type of CPU you use.\n", "# Recall that you're using llvm, you can get this information from the command\n", "# ``llc --version`` to get the CPU type, and you can check ``/proc/cpuinfo``\n", "# for additional extensions that your processor might support.\n", "\n", "target = tvm.target.Target(target=\"llvm\", host=\"llvm\")\n", "dev = tvm.device(target.kind.name, 0)\n", "\n", "# Random generated tensor for testing\n", "a = tvm.nd.array(numpy.random.rand(M, K).astype(dtype), dev)\n", "b = tvm.nd.array(numpy.random.rand(K, N).astype(dtype), dev)\n", "\n", "# Repeatedly perform a matrix multiplication to get a performance baseline\n", "# for the default numpy implementation\n", "np_repeat = 100\n", "np_running_time = timeit.timeit(\n", " setup=\"import numpy\\n\"\n", " \"M = \" + str(M) + \"\\n\"\n", " \"K = \" + str(K) + \"\\n\"\n", " \"N = \" + str(N) + \"\\n\"\n", " 'dtype = \"float32\"\\n'\n", " \"a = numpy.random.rand(M, K).astype(dtype)\\n\"\n", " \"b = numpy.random.rand(K, N).astype(dtype)\\n\",\n", " stmt=\"answer = numpy.dot(a, b)\",\n", " number=np_repeat,\n", ")\n", "print(\"Numpy running time: %f\" % (np_running_time / np_repeat))\n", "\n", "answer = numpy.dot(a.numpy(), b.numpy())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "现在用 TVM TE 编写基本的矩阵乘法,并验证它产生的结果与 `numpy` 的实现相同。我们还写了一个函数,它将帮助衡量调度优化的性能。" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "none: 2.233156\n" ] } ], "source": [ "# TVM Matrix Multiplication using TE\n", "k = te.reduce_axis((0, K), \"k\")\n", "A = te.placeholder((M, K), name=\"A\")\n", "B = te.placeholder((K, N), name=\"B\")\n", "C = te.compute((M, N), lambda x, y: te.sum(A[x, k] * B[k, y], axis=k), name=\"C\")\n", "\n", "# Default schedule\n", "s = te.create_schedule(C.op)\n", "func = tvm.build(s, [A, B, C], target=target, name=\"mmult\")\n", "\n", "c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)\n", "func(a, b, c)\n", "tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)\n", "\n", "\n", "def evaluate_operation(s, vars, target, name, optimization, log):\n", " func = tvm.build(s, [A, B, C], target=target, name=\"mmult\")\n", " assert func\n", "\n", " c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)\n", " func(a, b, c)\n", " tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)\n", "\n", " evaluator = func.time_evaluator(func.entry_name, dev, number=10)\n", " mean_time = evaluator(a, b, c).mean\n", " print(\"%s: %f\" % (optimization, mean_time))\n", " log.append((optimization, mean_time))\n", "\n", "\n", "log = []\n", "\n", "evaluate_operation(s, [A, B, C], target=target, name=\"mmult\", optimization=\"none\", log=log)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "让我们来看看使用 TVM 低级函数的运算器和默认调度的中间表示。请注意这个实现基本上是矩阵乘法的天真实现,在 A 和 B 矩阵的索引上使用三个嵌套循环。\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()\n", " attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n", " buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),\n", " B: Buffer(B_2: Pointer(float32), float32, [1048576], []),\n", " C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}\n", " buffer_map = {A_1: A, B_1: B, C_1: C}\n", " preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {\n", " for (x: int32, 0, 1024) {\n", " for (y: int32, 0, 1024) {\n", " C[((x*1024) + y)] = 0f32\n", " for (k: int32, 0, 1024) {\n", " let cse_var_2: int32 = (x*1024)\n", " let cse_var_1: int32 = (cse_var_2 + y)\n", " C[cse_var_1] = (C[cse_var_1] + (A[(cse_var_2 + k)]*B[((k*1024) + y)]))\n", " }\n", " }\n", " }\n", "}\n", "\n", "\n" ] } ], "source": [ "print(tvm.lower(s, [A, B, C], simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 优化1:阻塞\n", "\n", "提高缓冲区命中率的一个重要技巧是阻塞,在这个过程中,你的内存访问结构是在一个块的内部有一个小的邻域,具有很高的内存定位性。在本教程中,我们选择一个 32 的块因子。这将导致一个块充满 32 * 32 * sizeof(float) 的内存区域。这相当于一个 4KB 的缓存大小,而 L1 缓存的参考缓存大小为 32KB。\n", "\n", "我们首先为 ``C`` 操作创建一个默认的调度,然后用指定的块因子对其应用一个 `tile` 调度原语,调度原语返回所产生的循环顺序,从最外层到最内层,作为一个向量 `[x_outer, y_outer, x_inner, y_inner]`。然后我们得到操作输出的还原轴,并使用 4 的因子对其进行分割操作。这个因子并不直接影响我们现在正在进行的阻塞优化,但在以后我们应用矢量化时将会很有用。\n", "\n", "现在操作已经被阻塞了,我们可以重新调度计算的顺序,把减少操作放到计算的最外层循环中,帮助保证被阻止的数据仍然在缓存中。这样就完成了调度,我们可以建立并测试与原生的调度相比的性能。" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "blocking: 0.249289\n" ] } ], "source": [ "bn = 32\n", "\n", "# Blocking by loop tiling\n", "xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)\n", "(k,) = s[C].op.reduce_axis\n", "ko, ki = s[C].split(k, factor=4)\n", "\n", "# Hoist reduction domain outside the blocking loop\n", "s[C].reorder(xo, yo, ko, ki, xi, yi)\n", "\n", "evaluate_operation(s, [A, B, C], target=target, name=\"mmult\", optimization=\"blocking\", log=log)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "通过重新安排计算顺序以利用缓存,你应该看到计算的性能有了明显的改善。现在,打印内部表示,并将其与原始表示进行比较。" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()\n", " attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n", " buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),\n", " B: Buffer(B_2: Pointer(float32), float32, [1048576], []),\n", " C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}\n", " buffer_map = {A_1: A, B_1: B, C_1: C}\n", " preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {\n", " for (x.outer: int32, 0, 32) {\n", " for (y.outer: int32, 0, 32) {\n", " for (x.inner.init: int32, 0, 32) {\n", " for (y.inner.init: int32, 0, 32) {\n", " C[((((x.outer*32768) + (x.inner.init*1024)) + (y.outer*32)) + y.inner.init)] = 0f32\n", " }\n", " }\n", " for (k.outer: int32, 0, 256) {\n", " for (k.inner: int32, 0, 4) {\n", " for (x.inner: int32, 0, 32) {\n", " for (y.inner: int32, 0, 32) {\n", " let cse_var_3: int32 = (y.outer*32)\n", " let cse_var_2: int32 = ((x.outer*32768) + (x.inner*1024))\n", " let cse_var_1: int32 = ((cse_var_2 + cse_var_3) + y.inner)\n", " C[cse_var_1] = (C[cse_var_1] + (A[((cse_var_2 + (k.outer*4)) + k.inner)]*B[((((k.outer*4096) + (k.inner*1024)) + cse_var_3) + y.inner)]))\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n", "\n" ] } ], "source": [ "print(tvm.lower(s, [A, B, C], simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 优化 2: 矢量化\n", "\n", "另一个重要的优化技巧是矢量化。当内存访问模式是统一的,编译器可以检测到这种模式并将连续的内存传递给 SIMD 矢量处理器。在 TVM 中,我们可以使用 ``vectorize`` 接口来提示编译器这种模式,利用这一硬件特性。\n", "\n", "在本教程中,我们选择对内循环的行数据进行矢量化,因为在我们之前的优化中,它已经是缓存友好的。" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vectorization: 0.257297\n", "@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()\n", " attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n", " buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),\n", " B: Buffer(B_2: Pointer(float32), float32, [1048576], []),\n", " C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}\n", " buffer_map = {A_1: A, B_1: B, C_1: C}\n", " preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {\n", " for (x.outer: int32, 0, 32) {\n", " for (y.outer: int32, 0, 32) {\n", " for (x.inner.init: int32, 0, 32) {\n", " C[ramp((((x.outer*32768) + (x.inner.init*1024)) + (y.outer*32)), 1, 32)] = broadcast(0f32, 32)\n", " }\n", " for (k.outer: int32, 0, 256) {\n", " for (k.inner: int32, 0, 4) {\n", " for (x.inner: int32, 0, 32) {\n", " let cse_var_3: int32 = (y.outer*32)\n", " let cse_var_2: int32 = ((x.outer*32768) + (x.inner*1024))\n", " let cse_var_1: int32 = (cse_var_2 + cse_var_3)\n", " C[ramp(cse_var_1, 1, 32)] = (C[ramp(cse_var_1, 1, 32)] + (broadcast(A[((cse_var_2 + (k.outer*4)) + k.inner)], 32)*B[ramp((((k.outer*4096) + (k.inner*1024)) + cse_var_3), 1, 32)]))\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n", "\n" ] } ], "source": [ "# Apply the vectorization optimization\n", "s[C].vectorize(yi)\n", "\n", "evaluate_operation(s, [A, B, C], target=target, name=\"mmult\", optimization=\"vectorization\", log=log)\n", "\n", "# The generalized IR after vectorization\n", "print(tvm.lower(s, [A, B, C], simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 优化3:循环交换\n", "\n", "如果我们看一下上面的 IR,我们可以看到内循环的行数据被矢量化,B 被转化为 PackedB(这从内循环的 `(float32x32*)B2` 部分可以看出)。现在 PackedB 的遍历是顺序的。所以我们要看一下 A 的访问模式。在当前的计划中,A 是被逐列访问的,这对缓冲区不友好。如果我们改变 `ki` 和内轴 `xi` 的嵌套循环顺序,A 矩阵的访问模式将对缓存更友好。" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loop permutation: 0.130006\n", "@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()\n", " attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n", " buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),\n", " B: Buffer(B_2: Pointer(float32), float32, [1048576], []),\n", " C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}\n", " buffer_map = {A_1: A, B_1: B, C_1: C}\n", " preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {\n", " for (x.outer: int32, 0, 32) {\n", " for (y.outer: int32, 0, 32) {\n", " for (x.inner.init: int32, 0, 32) {\n", " C[ramp((((x.outer*32768) + (x.inner.init*1024)) + (y.outer*32)), 1, 32)] = broadcast(0f32, 32)\n", " }\n", " for (k.outer: int32, 0, 256) {\n", " for (x.inner: int32, 0, 32) {\n", " for (k.inner: int32, 0, 4) {\n", " let cse_var_3: int32 = (y.outer*32)\n", " let cse_var_2: int32 = ((x.outer*32768) + (x.inner*1024))\n", " let cse_var_1: int32 = (cse_var_2 + cse_var_3)\n", " C[ramp(cse_var_1, 1, 32)] = (C[ramp(cse_var_1, 1, 32)] + (broadcast(A[((cse_var_2 + (k.outer*4)) + k.inner)], 32)*B[ramp((((k.outer*4096) + (k.inner*1024)) + cse_var_3), 1, 32)]))\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n", "\n" ] } ], "source": [ "s = te.create_schedule(C.op)\n", "xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)\n", "(k,) = s[C].op.reduce_axis\n", "ko, ki = s[C].split(k, factor=4)\n", "\n", "# re-ordering\n", "s[C].reorder(xo, yo, ko, xi, ki, yi)\n", "s[C].vectorize(yi)\n", "\n", "evaluate_operation(\n", " s, [A, B, C], target=target, name=\"mmult\", optimization=\"loop permutation\", log=log\n", ")\n", "\n", "# Again, print the new generalized IR\n", "print(tvm.lower(s, [A, B, C], simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 优化4:数组打包\n", "\n", "另一个重要的技巧是数组打包。这个技巧是对数组的存储维度进行重新排序,将某些维度上的连续访问模式转换为扁平化后的顺序模式。\n", "\n", "![](images/array-packing.png)\n", "\n", "正如上图所示,在阻塞计算后,我们可以观察到 B 的数组访问模式(扁平化后),它是有规律的,但是不连续的。我们期望经过一些转换后,我们可以得到一个连续的访问模式。通过将 `[16][16]` 数组重新排序为 `[16/4][16][4]` 数组,当从打包的数组中抓取相应的值时,B 的访问模式将是连续的。\n", "\n", "为了实现这一目标,将不得不从新的默认调度开始,考虑到 B 的新包装,值得花点时间来评论一下。TE 是编写优化运算符的强大而富有表现力的语言,但它往往需要对你所编写的底层算法、数据结构和硬件目标有一些了解。在本教程的后面,我们将讨论一些让 TVM 承担这一负担的选项。无论如何,让我们继续编写新的优化调度。" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "array packing: 0.140257\n", "@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()\n", " attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n", " buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),\n", " B: Buffer(B_2: Pointer(float32), float32, [1048576], []),\n", " C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}\n", " buffer_map = {A_1: A, B_1: B, C_1: C}\n", " preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {\n", " allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global {\n", " for (x: int32, 0, 32) \"parallel\" {\n", " for (y: int32, 0, 1024) {\n", " packedB_1: Buffer(packedB, float32x32, [32768], [])[((x*1024) + y)] = B[ramp(((y*1024) + (x*32)), 1, 32)]\n", " }\n", " }\n", " for (x.outer: int32, 0, 32) {\n", " for (y.outer: int32, 0, 32) {\n", " for (x.inner.init: int32, 0, 32) {\n", " C[ramp((((x.outer*32768) + (x.inner.init*1024)) + (y.outer*32)), 1, 32)] = broadcast(0f32, 32)\n", " }\n", " for (k.outer: int32, 0, 256) {\n", " for (x.inner: int32, 0, 32) {\n", " for (k.inner: int32, 0, 4) {\n", " let cse_var_3: int32 = ((x.outer*32768) + (x.inner*1024))\n", " let cse_var_2: int32 = (k.outer*4)\n", " let cse_var_1: int32 = (cse_var_3 + (y.outer*32))\n", " C[ramp(cse_var_1, 1, 32)] = (C[ramp(cse_var_1, 1, 32)] + (broadcast(A[((cse_var_3 + cse_var_2) + k.inner)], 32)*packedB_1[(((y.outer*1024) + cse_var_2) + k.inner)]))\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n", "\n" ] } ], "source": [ "# 必须重写算法。\n", "packedB = te.compute((N / bn, K, bn), lambda x, y, z: B[y, x * bn + z], name=\"packedB\")\n", "C = te.compute(\n", " (M, N),\n", " lambda x, y: te.sum(A[x, k] * packedB[y // bn, k, tvm.tir.indexmod(y, bn)], axis=k),\n", " name=\"C\",\n", ")\n", "\n", "s = te.create_schedule(C.op)\n", "\n", "xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)\n", "(k,) = s[C].op.reduce_axis\n", "ko, ki = s[C].split(k, factor=4)\n", "\n", "s[C].reorder(xo, yo, ko, xi, ki, yi)\n", "s[C].vectorize(yi)\n", "\n", "x, y, z = s[packedB].op.axis\n", "s[packedB].vectorize(z)\n", "s[packedB].parallel(x)\n", "\n", "evaluate_operation(s, [A, B, C], target=target, name=\"mmult\", optimization=\"array packing\", log=log)\n", "\n", "# Here is the generated IR after array packing.\n", "print(tvm.lower(s, [A, B, C], simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 优化 5:通过缓存优化块的写入\n", "\n", "到目前为止,我们所有的优化都集中在有效地访问和计算 `A` 和 `B` 矩阵的数据以计算 `C` 矩阵上。在阻塞优化之后,运算器将逐块地将结果写入 `C`,而且访问模式不是顺序的。我们可以通过使用一个顺序缓存数组来解决这个问题,使用 `cache_write`、`compute_at` 和 `unroll` 的组合来保存块结果,并在所有块结果准备好后写入 `C`。" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "block caching: 0.131626\n", "@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()\n", " attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n", " buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),\n", " B: Buffer(B_2: Pointer(float32), float32, [1048576], []),\n", " C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}\n", " buffer_map = {A_1: A, B_1: B, C_1: C}\n", " preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {\n", " allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global;\n", " allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global {\n", " for (x: int32, 0, 32) \"parallel\" {\n", " for (y: int32, 0, 1024) {\n", " packedB_1: Buffer(packedB, float32x32, [32768], [])[((x*1024) + y)] = B[ramp(((y*1024) + (x*32)), 1, 32)]\n", " }\n", " }\n", " for (x.outer: int32, 0, 32) {\n", " for (y.outer: int32, 0, 32) {\n", " for (x.c.init: int32, 0, 32) {\n", " C.global_1: Buffer(C.global, float32, [1024], [])[ramp((x.c.init*32), 1, 32)] = broadcast(0f32, 32)\n", " }\n", " for (k.outer: int32, 0, 256) {\n", " for (x.c: int32, 0, 32) {\n", " let cse_var_4: int32 = (k.outer*4)\n", " let cse_var_3: int32 = (x.c*32)\n", " let cse_var_2: int32 = ((y.outer*1024) + cse_var_4)\n", " let cse_var_1: int32 = (((x.outer*32768) + (x.c*1024)) + cse_var_4)\n", " {\n", " C.global_1[ramp(cse_var_3, 1, 32)] = (C.global_1[ramp(cse_var_3, 1, 32)] + (broadcast(A[cse_var_1], 32)*packedB_1[cse_var_2]))\n", " C.global_1[ramp(cse_var_3, 1, 32)] = (C.global_1[ramp(cse_var_3, 1, 32)] + (broadcast(A[(cse_var_1 + 1)], 32)*packedB_1[(cse_var_2 + 1)]))\n", " C.global_1[ramp(cse_var_3, 1, 32)] = (C.global_1[ramp(cse_var_3, 1, 32)] + (broadcast(A[(cse_var_1 + 2)], 32)*packedB_1[(cse_var_2 + 2)]))\n", " C.global_1[ramp(cse_var_3, 1, 32)] = (C.global_1[ramp(cse_var_3, 1, 32)] + (broadcast(A[(cse_var_1 + 3)], 32)*packedB_1[(cse_var_2 + 3)]))\n", " }\n", " }\n", " }\n", " for (x.inner: int32, 0, 32) {\n", " for (y.inner: int32, 0, 32) {\n", " C[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)] = C.global_1[((x.inner*32) + y.inner)]\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n", "\n" ] } ], "source": [ "s = te.create_schedule(C.op)\n", "\n", "# Allocate write cache\n", "CC = s.cache_write(C, \"global\")\n", "\n", "xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)\n", "\n", "# Write cache is computed at yo\n", "s[CC].compute_at(s[C], yo)\n", "\n", "# New inner axes\n", "xc, yc = s[CC].op.axis\n", "\n", "(k,) = s[CC].op.reduce_axis\n", "ko, ki = s[CC].split(k, factor=4)\n", "s[CC].reorder(ko, xc, ki, yc)\n", "s[CC].unroll(ki)\n", "s[CC].vectorize(yc)\n", "\n", "x, y, z = s[packedB].op.axis\n", "s[packedB].vectorize(z)\n", "s[packedB].parallel(x)\n", "\n", "evaluate_operation(s, [A, B, C], target=target, name=\"mmult\", optimization=\"block caching\", log=log)\n", "\n", "# Here is the generated IR after write cache blocking.\n", "print(tvm.lower(s, [A, B, C], simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 优化6:并行化\n", "\n", "到目前为止,我们的计算只被设计为使用单核。几乎所有的现代处理器都有多个内核,计算可以从并行运行的计算中获益。最后的优化是利用线程级并行化的优势。" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "parallelization: 0.026403\n", "@main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()\n", " attr = {\"from_legacy_te_schedule\": True, \"global_symbol\": \"main\", \"tir.noalias\": True}\n", " buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),\n", " B: Buffer(B_2: Pointer(float32), float32, [1048576], []),\n", " C: Buffer(C_2: Pointer(float32), float32, [1048576], [])}\n", " buffer_map = {A_1: A, B_1: B, C_1: C}\n", " preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 1024], []), B_1: B_3: Buffer(B_2, float32, [1024, 1024], []), C_1: C_3: Buffer(C_2, float32, [1024, 1024], [])} {\n", " allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global {\n", " for (x: int32, 0, 32) \"parallel\" {\n", " for (y: int32, 0, 1024) {\n", " packedB_1: Buffer(packedB, float32x32, [32768], [])[((x*1024) + y)] = B[ramp(((y*1024) + (x*32)), 1, 32)]\n", " }\n", " }\n", " for (x.outer: int32, 0, 32) \"parallel\" {\n", " allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global;\n", " for (y.outer: int32, 0, 32) {\n", " for (x.c.init: int32, 0, 32) {\n", " C.global_1: Buffer(C.global, float32, [1024], [])[ramp((x.c.init*32), 1, 32)] = broadcast(0f32, 32)\n", " }\n", " for (k.outer: int32, 0, 256) {\n", " for (x.c: int32, 0, 32) {\n", " let cse_var_4: int32 = (k.outer*4)\n", " let cse_var_3: int32 = (x.c*32)\n", " let cse_var_2: int32 = ((y.outer*1024) + cse_var_4)\n", " let cse_var_1: int32 = (((x.outer*32768) + (x.c*1024)) + cse_var_4)\n", " {\n", " C.global_1[ramp(cse_var_3, 1, 32)] = (C.global_1[ramp(cse_var_3, 1, 32)] + (broadcast(A[cse_var_1], 32)*packedB_1[cse_var_2]))\n", " C.global_1[ramp(cse_var_3, 1, 32)] = (C.global_1[ramp(cse_var_3, 1, 32)] + (broadcast(A[(cse_var_1 + 1)], 32)*packedB_1[(cse_var_2 + 1)]))\n", " C.global_1[ramp(cse_var_3, 1, 32)] = (C.global_1[ramp(cse_var_3, 1, 32)] + (broadcast(A[(cse_var_1 + 2)], 32)*packedB_1[(cse_var_2 + 2)]))\n", " C.global_1[ramp(cse_var_3, 1, 32)] = (C.global_1[ramp(cse_var_3, 1, 32)] + (broadcast(A[(cse_var_1 + 3)], 32)*packedB_1[(cse_var_2 + 3)]))\n", " }\n", " }\n", " }\n", " for (x.inner: int32, 0, 32) {\n", " for (y.inner: int32, 0, 32) {\n", " C[((((x.outer*32768) + (x.inner*1024)) + (y.outer*32)) + y.inner)] = C.global_1[((x.inner*32) + y.inner)]\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n", "\n" ] } ], "source": [ "# parallel\n", "s[C].parallel(xo)\n", "\n", "x, y, z = s[packedB].op.axis\n", "s[packedB].vectorize(z)\n", "s[packedB].parallel(x)\n", "\n", "evaluate_operation(\n", " s, [A, B, C], target=target, name=\"mmult\", optimization=\"parallelization\", log=log\n", ")\n", "\n", "# Here is the generated IR after parallelization.\n", "print(tvm.lower(s, [A, B, C], simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 矩阵乘法实例总结\n", "\n", "在应用了上述仅有 18 行代码的简单优化后,我们生成的代码可以开始接近 `numpy` 与 Math Kernel Library(MKL)的性能。由于我们在工作中一直在记录性能,所以我们可以比较一下结果。" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Operator\t Timing\t Performance\n", " none\t 2.233155965\t 1.0\n", " blocking\t 0.2492886151\t 0.11163063351018566\n", " vectorization\t 0.257297356\t 0.1152169217164373\n", " loop permutation\t 0.1300062278\t0.058216367256731225\n", " array packing\t 0.140256607\t 0.06280645382509144\n", " block caching\t 0.13162604679999998\t 0.05894171695258194\n", " parallelization\t 0.026402827\t0.011823100318029064\n" ] } ], "source": [ "baseline = log[0][1]\n", "print(\"%s\\t%s\\t%s\" % (\"Operator\".rjust(20), \"Timing\".rjust(20), \"Performance\".rjust(20)))\n", "for result in log:\n", " print(\n", " \"%s\\t%s\\t%s\"\n", " % (result[0].rjust(20), str(result[1]).rjust(20), str(result[1] / baseline).rjust(20))\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "请注意,网页上的输出反映了在一个非独家 Docker 容器上的运行时间,应该被认为是不可靠的。强烈建议你自己运行该教程,观察 TVM 取得的性能提升,并仔细研究每个例子,了解对矩阵乘法操作的迭代改进。\n", "\n", "## 最后说明和总结\n", "\n", "如前所述,如何使用 TE 和调度原语进行优化,可能需要对底层架构和算法有一些了解。然而,TE 的设计是作为更复杂的算法的基础,可以搜索潜在的优化。有了这篇关于 TE 的介绍中的知识,我们现在可以开始探索 TVM 如何将调度优化过程自动化。\n", "\n", "本教程提供了一个 TVM 张量表达(TE)工作流程的演练,使用了一个矢量添加和一个矩阵乘法的例子。一般的工作流程是:\n", "\n", "- 通过一系列的操作来描述你的计算。\n", "- 描述我们要如何计算使用调度原语。\n", "- 编译到我们想要的目标函数。\n", "- 可以选择保存该函数以便以后加载。\n", "\n", "即将出版的教程扩展了矩阵乘法的例子,并展示了你如何建立矩阵乘法和其他操作的通用模板,这些模板具有可调整的参数,允许你为特定平台自动优化计算。\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('py38': conda)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "vscode": { "interpreter": { "hash": "28558e8daad512806f5c536a1a04c119185f99f65b79002708a12162d02a79c7" } } }, "nbformat": 4, "nbformat_minor": 0 }