{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# How to optimize GEMM on CPU\n**Author**: [Jian Weng](https://github.com/were),             [Ruofei Yu](https://github.com/yuruofeifei)\n\n(TL;DR) TVM provides abstract interfaces which allows users to depict an algorithm and the\nalgorithm's implementing organization (the so-called schedule) separately. Typically, writing\nalgorithm in high-performance schedule breaks the algorithm's readability and modularity. Also,\ntrying various seemingly promising schedules is time-consuming. With the help of TVM, we can\ntry these schedules efficiently to enhance the performance.\n\nIn this tutorial, we will demonstrate how to use TVM to optimize square matrix multiplication\nand achieve 200 times faster than baseline by simply adding 18 extra lines of code.\n\nThere are two important optimizations on intense computation applications executed on CPU:\n    1. Increase the cache hit rate of memory access. Both complex numerical computation and hot-spot\n       memory access can be accelerated from high cache hit rate. This requires us to transform the\n       origin memory access pattern to the pattern fits the cache policy.\n    2. SIMD (Single instruction multi-data), or we call it vector processing unit. Every time, a\n       small batch of data, rather than a single grid, will be processed. This requires us to\n       transform the data access pattern in the loop body in uniform pattern so that the LLVM\n       backend can lower it to SIMD.\n\nActually, all the methodologies used in this tutorial is a subset of tricks mentioned in this\n[repo](https://github.com/flame/how-to-optimize-gemm). Some of them have been applied by TVM\nabstraction automatically, but some of them cannot be simply applied due to TVM constraints.\n\nAll the experiment results mentioned below, are executed on 2015's 15' MacBook equipped with\nIntel i7-4770HQ CPU. The cache line size should be 64 bytes for all the x86 CPUs.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        ""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Preparation and Baseline\nIn this tutorial, we will demo how to use TVM to optimize matrix multiplication.\nBefore actually demonstrating, we first define these variables.\nThen we write a baseline implementation, the simplest way to write a matrix multiplication in TVM.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import tvm\nimport tvm.testing\nfrom tvm import te\nimport numpy\nimport timeit\n\n# The size of the matrix\n# (M, K) x (K, N)\n# You are free to try out different shapes, sometimes TVM optimization outperforms numpy with MKL.\nM = 1024\nK = 1024\nN = 1024\n\n# The default tensor type in tvm\ndtype = \"float32\"\n\n# using Intel AVX2(Advanced Vector Extensions) ISA for SIMD\n# To get the best performance, please change the following line\n# to llvm -mcpu=core-avx2, or specific type of CPU you use\ntarget = \"llvm\"\ndev = tvm.device(target, 0)\n\n# Random generated tensor for testing\na = tvm.nd.array(numpy.random.rand(M, K).astype(dtype), dev)\nb = tvm.nd.array(numpy.random.rand(K, N).astype(dtype), dev)\n\nnp_repeat = 100\nnp_runing_time = timeit.timeit(\n    setup=\"import numpy\\n\"\n    \"M = \" + str(M) + \"\\n\"\n    \"K = \" + str(K) + \"\\n\"\n    \"N = \" + str(N) + \"\\n\"\n    'dtype = \"float32\"\\n'\n    \"a = numpy.random.rand(M, K).astype(dtype)\\n\"\n    \"b = numpy.random.rand(K, N).astype(dtype)\\n\",\n    stmt=\"answer = numpy.dot(a, b)\",\n    number=np_repeat,\n)\nprint(\"Numpy running time: %f\" % (np_runing_time / np_repeat))\n\nanswer = numpy.dot(a.numpy(), b.numpy())\n\n# Algorithm\nk = te.reduce_axis((0, K), \"k\")\nA = te.placeholder((M, K), name=\"A\")\nB = te.placeholder((K, N), name=\"B\")\nC = te.compute((M, N), lambda m, n: te.sum(A[m, k] * B[k, n], axis=k), name=\"C\")\n\n# Default schedule\ns = te.create_schedule(C.op)\nfunc = tvm.build(s, [A, B, C], target=target, name=\"mmult\")\nassert func\n\nc = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)\nfunc(a, b, c)\ntvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)\n\nevaluator = func.time_evaluator(func.entry_name, dev, number=1)\nprint(\"Baseline: %f\" % evaluator(a, b, c).mean)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In TVM, we can always inspect lower level IR to debug or optimize our schedule.\nHere is the generated IR using our baseline schedule.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(tvm.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Blocking\nA important trick to enhance the cache hit rate is blocking --- data chunk will be computed\nblock by block. The memory access inside the block is a small neighbourhood which is with high\nmemory locality. In this tutorial, I picked up 32 as the blocking factor. So the block will\nfill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB (L1 data cache)\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "bn = 32\nkfactor = 4\ns = te.create_schedule(C.op)\n\n# Blocking by loop tiling\nmo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)\n(kaxis,) = s[C].op.reduce_axis\nko, ki = s[C].split(kaxis, factor=kfactor)\n\n# Hoist reduction domain outside the blocking loop\ns[C].reorder(mo, no, ko, ki, mi, ni)\n\nfunc = tvm.build(s, [A, B, C], target=target, name=\"mmult\")\nassert func\n\nc = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)\nfunc(a, b, c)\ntvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)\n\n# By simply tiling the loop 32x32, and hoisting ko, ki outside the blocking loops,\n# we can see big speedup compared with the baseline.\nevaluator = func.time_evaluator(func.entry_name, dev, number=10)\nprint(\"Opt1: %f\" % evaluator(a, b, c).mean)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here is the generated IR after blocking.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(tvm.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Vectorization\nAnother important trick is vectorization. When the memory access pattern is uniform,\nthe compiler can detect this pattern and pass the continuous memory to vector processor. In TVM,\nwe can use `vectorize` interface to hint the compiler this pattern, so that we can accelerate it\nvastly.\n\nIn this tutorial, we chose to vectorize the inner loop row data since it is cache friendly.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "s = te.create_schedule(C.op)\nmo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)\n(kaxis,) = s[C].op.reduce_axis\nko, ki = s[C].split(kaxis, factor=kfactor)\n\ns[C].reorder(mo, no, ko, ki, mi, ni)\n\n# Vectorization\ns[C].vectorize(ni)\n\nfunc = tvm.build(s, [A, B, C], target=target, name=\"mmult\")\nassert func\n\nc = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)\nfunc(a, b, c)\ntvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)\n\nevaluator = func.time_evaluator(func.entry_name, dev, number=10)\nprint(\"Opt2: %f\" % evaluator(a, b, c).mean)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here is the generated IR after vectorization.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(tvm.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Loop Permutation\nIf we look at the above IR, we can see the inner loop row data is vectorized for both B and C.\nNext we will look at the access pattern of A. In current schedule, A is accessed column by column\nwhich is not cache friendly. If we change the nested loop order of ki and inner axes mi,\nthe access pattern for A matrix is more cache friendly.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "s = te.create_schedule(C.op)\nmo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)\n(kaxis,) = s[C].op.reduce_axis\nko, ki = s[C].split(kaxis, factor=kfactor)\n\n# re-ordering\ns[C].reorder(mo, no, ko, mi, ki, ni)\ns[C].vectorize(ni)\n\nfunc = tvm.build(s, [A, B, C], target=target, name=\"mmult\")\nassert func\n\nc = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)\nfunc(a, b, c)\ntvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)\n\nevaluator = func.time_evaluator(func.entry_name, dev, number=10)\nprint(\"Opt3: %f\" % evaluator(a, b, c).mean)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here is the generated IR after loop permutation.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(tvm.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Array Packing\nAnother important trick is array packing. The trick is to reorder the storage of a multi-\ndimensional array so that it is accessed sequentially after it is flattened and stored in one-\ndimensional memory.\n\n<img src=\"https://github.com/dmlc/web-data/raw/main/tvm/tutorial/array-packing.png\" align=\"center\">\n\nNOTE: This figure is a general illustration of how array packing works.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can use array packing to address the access pattern for B. Observe the array access pattern of\nB after flattening which is not sequential as we iterate over the K dimension. We can reorder B\nwith dimensions [K][N] so that it has dimensions [N/bn][K][bn] where bn is the blocking factor and\nalso the vector size for B in the inner loop.  This reorder splits N into two dimensions ---\nbigN (N/bn) and littleN (bn) --- and the new dimensions [N/bn][K][bn] match the indexing of B\nfrom outer to inner loops (no, ko, ki, ni) resulting in a sequential access pattern for B after\nflattening.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# We have to re-write the algorithm slightly.\npackedB = te.compute(\n    (N / bn, K, bn), lambda bigN, k, littleN: B[k, bigN * bn + littleN], name=\"packedB\"\n)\nC = te.compute(\n    (M, N),\n    lambda m, n: te.sum(A[m, k] * packedB[n // bn, k, tvm.tir.indexmod(n, bn)], axis=k),\n    name=\"C\",\n)\n\ns = te.create_schedule(C.op)\n\nmo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)\n(kaxis,) = s[C].op.reduce_axis\nko, ki = s[C].split(kaxis, factor=kfactor)\n\ns[C].reorder(mo, no, ko, mi, ki, ni)\ns[C].vectorize(ni)\n\nbigN, _, littleN = s[packedB].op.axis\ns[packedB].vectorize(littleN)\ns[packedB].parallel(bigN)\n\nfunc = tvm.build(s, [A, B, C], target=target, name=\"mmult\")\nassert func\n\nc = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)\nfunc(a, b, c)\ntvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)\n\nevaluator = func.time_evaluator(func.entry_name, dev, number=10)\nprint(\"Opt4: %f\" % evaluator(a, b, c).mean)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here is the generated IR after array packing.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(tvm.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Write cache for blocks\nAfter blocking, the program will write result to C block by block, the access pattern\nis not sequential. So we can use a sequential cache array to hold the block results and\nwrite to C when all the block results are ready.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "s = te.create_schedule(C.op)\n\n# Allocate write cache\nCC = s.cache_write(C, \"global\")\n\nmo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)\n\n# Write cache is computed at no\ns[CC].compute_at(s[C], no)\n\n# New inner axes\nmc, nc = s[CC].op.axis\n\n(kaxis,) = s[CC].op.reduce_axis\nko, ki = s[CC].split(kaxis, factor=kfactor)\ns[CC].reorder(ko, mc, ki, nc)\ns[CC].vectorize(nc)\n\n# TODO: Add separate optimization step to discuss loop unrolling\n# unrolling is a loop optimization strategy which can reduce branch\n# prediction failures and increases the chance of concurrent execution\n# unroll kfactor loops\ns[CC].unroll(ki)\n\nbigN, _, littleN = s[packedB].op.axis\ns[packedB].vectorize(littleN)\ns[packedB].parallel(bigN)\n\nfunc = tvm.build(s, [A, B, C], target=target, name=\"mmult\")\nassert func\n\nc = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)\nfunc(a, b, c)\ntvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)\n\nevaluator = func.time_evaluator(func.entry_name, dev, number=10)\nprint(\"Opt5: %f\" % evaluator(a, b, c).mean)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here is the generated IR after blocking.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(tvm.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Parallel\nFurthermore, we can also utilize multi-core processors to do the thread-level parallelization.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "s = te.create_schedule(C.op)\n\nCC = s.cache_write(C, \"global\")\n\nmo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)\n\ns[CC].compute_at(s[C], no)\n\nmc, nc = s[CC].op.axis\n\n(kaxis,) = s[CC].op.reduce_axis\nko, ki = s[CC].split(kaxis, factor=kfactor)\ns[CC].reorder(ko, mc, ki, nc)\ns[CC].vectorize(nc)\ns[CC].unroll(ki)\n\n# parallel\ns[C].parallel(mo)\n\nbigN, _, littleN = s[packedB].op.axis\ns[packedB].vectorize(littleN)\ns[packedB].parallel(bigN)\n\nfunc = tvm.build(s, [A, B, C], target=target, name=\"mmult\")\nassert func\n\nc = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)\nfunc(a, b, c)\ntvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)\n\nevaluator = func.time_evaluator(func.entry_name, dev, number=50)\nopt6_time = evaluator(a, b, c).mean\nprint(\"Opt6: %f\" % opt6_time)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here is the generated IR after parallelization.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(tvm.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Summary\nAfter applying the above simple optimizations with only 18 lines of code,\nour generated code can achieve 60% of the `numpy` performance with MKL.\nNote that the outputs on the web page reflect the running times on a non-exclusive\nDocker container, thereby they are *unreliable*. It is highly encouraged to run the\ntutorial by yourself to observe the performance gain achieved by TVM.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}