{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Auto-scheduling a Neural Network for NVIDIA GPU\n**Author**: [Lianmin Zheng](https://github.com/merrymercy)\n\nAuto-tuning for specific devices and workloads is critical for getting the\nbest performance. This is a tutorial on how to tune a whole neural\nnetwork for NVIDIA GPU with the auto-scheduler.\n\nTo auto-tune a neural network, we partition the network into small subgraphs and \ntune them independently. Each subgraph is treated as one search task.\nA task scheduler slices the time and dynamically allocates time resources to\nthese tasks. The task scheduler predicts the impact of each task on the end-to-end\nexecution time and prioritizes the one that can reduce the execution time the most.\n\nFor each subgraph, we use the compute declaration in :code:`tvm/python/topi` to\nget the computational DAG in the tensor expression form.\nWe then use the auto-scheduler to construct a search space of this DAG and search\nfor good schedules (low-level optimizations).\n\nDifferent from the template-based `autotvm <tutorials-autotvm-sec>` which relies on\nmanual templates to define the search space, the auto-scheduler does not require any\nschedule templates. In other words, the auto-scheduler only uses the compute declarations\nin :code:`tvm/python/topi` and does not use existing schedule templates.\n\nNote that this tutorial will not run on Windows or recent versions of macOS. To\nget it to run, you will need to wrap the body of this tutorial in a :code:`if\n__name__ == \"__main__\":` block.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\nimport tvm\nfrom tvm import relay, auto_scheduler\nimport tvm.relay.testing\nfrom tvm.contrib import graph_executor"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Define a Network\nFirst, we need to define the network with relay frontend API.\nWe can load some pre-defined network from :code:`tvm.relay.testing`.\nWe can also load models from MXNet, ONNX, PyTorch, and TensorFlow\n(see `front end tutorials<tutorial-frontend>`).\n\nFor convolutional neural networks, although auto-scheduler can work correctly\nwith any layout, we found the best performance is typically achieved with NHWC layout.\nWe also implemented more optimizations for NHWC layout with the auto-scheduler.\nSo it is recommended to convert your models to NHWC layout to use the auto-scheduler.\nYou can use `ConvertLayout <convert-layout-usage>` pass to do the layout conversion in TVM.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def get_network(name, batch_size, layout=\"NHWC\", dtype=\"float32\"):\n    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n\n    # auto-scheduler prefers NHWC layout\n    if layout == \"NHWC\":\n        image_shape = (224, 224, 3)\n    elif layout == \"NCHW\":\n        image_shape = (3, 224, 224)\n    else:\n        raise ValueError(\"Invalid layout: \" + layout)\n\n    input_shape = (batch_size,) + image_shape\n    output_shape = (batch_size, 1000)\n\n    if name.startswith(\"resnet-\"):\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.resnet.get_workload(\n            num_layers=n_layer,\n            batch_size=batch_size,\n            layout=layout,\n            dtype=dtype,\n            image_shape=image_shape,\n        )\n    elif name.startswith(\"resnet3d-\"):\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.resnet.get_workload(\n            num_layers=n_layer,\n            batch_size=batch_size,\n            layout=layout,\n            dtype=dtype,\n            image_shape=image_shape,\n        )\n    elif name == \"mobilenet\":\n        mod, params = relay.testing.mobilenet.get_workload(\n            batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape\n        )\n    elif name == \"squeezenet_v1.1\":\n        assert layout == \"NCHW\", \"squeezenet_v1.1 only supports NCHW layout\"\n        mod, params = relay.testing.squeezenet.get_workload(\n            version=\"1.1\",\n            batch_size=batch_size,\n            dtype=dtype,\n            image_shape=image_shape,\n        )\n    elif name == \"inception_v3\":\n        input_shape = (batch_size, 3, 299, 299) if layout == \"NCHW\" else (batch_size, 299, 299, 3)\n        mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)\n    elif name == \"mxnet\":\n        # an example for mxnet model\n        from mxnet.gluon.model_zoo.vision import get_model\n\n        assert layout == \"NCHW\"\n\n        block = get_model(\"resnet18_v1\", pretrained=True)\n        mod, params = relay.frontend.from_mxnet(block, shape={\"data\": input_shape}, dtype=dtype)\n        net = mod[\"main\"]\n        net = relay.Function(\n            net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs\n        )\n        mod = tvm.IRModule.from_expr(net)\n\n    return mod, params, input_shape, output_shape\n\n\n# Define the neural network and compilation target\nnetwork = \"resnet-18\"\nbatch_size = 1\nlayout = \"NHWC\"\ntarget = tvm.target.Target(\"cuda\")\ndtype = \"float32\"\nlog_file = \"%s-%s-B%d-%s.json\" % (network, layout, batch_size, target.kind.name)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Extract Search Tasks\nNext, we extract the search tasks and their weights from a network.\nThe weight of a task is the number of appearances of the task's subgraph\nin the whole network.\nBy using the weight, we can approximate the end-to-end latency of the network\nas :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the\nlatency of a task and :code:`weight[t]` is the weight of the task.\nThe task scheduler will just optimize this objective.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Extract tasks from the network\nprint(\"Extract tasks...\")\nmod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)\ntasks, task_weights = auto_scheduler.extract_tasks(mod[\"main\"], params, target)\n\nfor idx, task in enumerate(tasks):\n    print(\"========== Task %d  (workload key: %s) ==========\" % (idx, task.workload_key))\n    print(task.compute_dag)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Begin Tuning\nNow, we set some options for tuning and launch the search tasks\n\n* :code:`measure_ctx` launches a different process for measurement to\n  provide isolation. It can protect the main process from GPU crashes\n  during measurement and avoid other runtime conflicts.\n* :code:`min_repeat_ms` defines the minimum duration of one \"repeat\" in every measurement.\n  This can warmup the GPU, which is necessary to get accurate measurement results.\n  Typically, we recommend a value >= 300 ms.\n* :code:`num_measure_trials` is the number of measurement trials we can use during the tuning.\n  You can set it to a small number (e.g., 200) for a fast demonstrative run.\n  In practice, we recommend setting it around :code:`900 * len(tasks)`,\n  which is typically enough for the search to converge.\n  For example, there are 24 tasks in resnet-18, so we can set it as 20000.\n  You can adjust this parameter according to your time budget.\n* In addition, we use :code:`RecordToFile` to dump measurement records into a log file,\n  The measurement records can be used to query the history best, resume the search,\n  and do more analyses later.\n* see :any:`auto_scheduler.TuningOptions`,\n  :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def run_tuning():\n    print(\"Begin tuning...\")\n    measure_ctx = auto_scheduler.LocalRPCMeasureContext(repeat=1, min_repeat_ms=300, timeout=10)\n\n    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)\n    tune_option = auto_scheduler.TuningOptions(\n        num_measure_trials=200,  # change this to 20000 to achieve the best performance\n        runner=measure_ctx.runner,\n        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],\n    )\n\n    tuner.tune(tune_option)\n\n\n# We do not run the tuning in our webpage server since it takes too long.\n# Uncomment the following line to run it by yourself.\n\n# run_tuning()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>Explain the printed information during tuning\n\n  During the tuning, a lot of information will be printed on the console.\n  They are used for debugging purposes. The most important info is the output\n  of the task scheduler. The following table is a sample output.\n\n```c\n----------------------------------------------------------------------\n------------------------------  [ Task Scheduler ]\n----------------------------------------------------------------------\n|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |\n-------------------------------------------------\n|    0 |        0.005 |           0.88 |     64 |\n|    1 |        0.010 |          99.10 |     64 |\n|    2 |        0.006 |           0.00 |     64 |\n|    3 |        0.145 |         979.78 |    384 |\n|    4 |        0.130 |        1097.02 |    384 |\n|    5 |        0.143 |         992.69 |    384 |\n|    6 |        0.076 |        1526.86 |    192 |\n|    7 |        0.115 |         999.44 |    320 |\n|    8 |        0.079 |        1449.39 |    320 |\n|    9 |        0.122 |         938.73 |    384 |\n|   10 |        0.063 |        1832.98 |    192 |\n|   11 |        0.072 |        1763.62 |    256 |\n|   12 |        0.062 |        2036.40 |    192 |\n|   13 |        0.068 |        1874.44 |    192 |\n|   14 |        0.049 |        2346.50 |    128 |\n|   15 |        0.076 |        1694.31 |    256 |\n|   16 |        0.067 |        1933.30 |    448 |\n|   17 |        0.076 |        1680.90 |    256 |\n|   18 |        0.022 |          98.43 |     64 |\n|   19 |        0.076 |        3112.55 |    192 |\n|   20 |        0.013 |        2026.44 |     64 |\n|   21 |        0.011 |        1136.69 |     64 |\n|   22 |        0.013 |         992.47 |     64 |\n|   23 |        0.020 |         627.56 |     64 |\n-------------------------------------------------\nEstimated total latency: 1.587 ms  Trials: 4992  Used time : 13296 s  Next ID: 3\n```\n  This table lists the latency and (estimated) speed of all tasks.\n  It also lists the allocation of measurement trials for all tasks.\n  The last line prints the total weighted latency of these tasks,\n  which can be a rough estimation of the end-to-end execution time\n  of the network.\n  The last line also prints the total number of measurement trials,\n  total time spent on auto-tuning and the id of the next task to tune.\n\n  There will also be some \"tvm::Error\"s and CUDA errors, because the\n  auto-scheduler will try some invalid schedules.\n  You can safely ignore them if the tuning can continue, because these\n  errors are isolated from the main process.</p></div>\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>Terminate the tuning earlier\n\n  You can terminate the tuning earlier by forcibly killing this process.\n  As long as you get at least one valid schedule for each task in the log file,\n  you should be able to do the compilation (the secion below).</p></div>\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Compile and Evaluate\nAfter auto-tuning, we can compile the network with the best schedules we found.\nAll measurement records are dumped into the log file during auto-tuning,\nso we can read the log file and load the best schedules.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Compile with the history best\nprint(\"Compile...\")\nwith auto_scheduler.ApplyHistoryBest(log_file):\n    with tvm.transform.PassContext(opt_level=3, config={\"relay.backend.use_auto_scheduler\": True}):\n        lib = relay.build(mod, target=target, params=params)\n\n# Create graph executor\ndev = tvm.device(str(target), 0)\nmodule = graph_executor.GraphModule(lib[\"default\"](dev))\ndata_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))\nmodule.set_input(\"data\", data_tvm)\n\n# Evaluate\nprint(\"Evaluate inference time cost...\")\nprint(module.benchmark(dev, repeat=3, min_repeat_ms=500))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Other Tips\n1. During the tuning, the auto-scheduler needs to compile many programs and\n   extract feature from them. This part is CPU-intensive,\n   so a high-performance CPU with many cores is recommended for faster search.\n2. You can use :code:`python3 -m tvm.auto_scheduler.measure_record --mode distill -i log.json`\n   to distill the large log file and only save the best useful records.\n3. You can resume a search from the previous log file. You just need to\n   add a new argument :code:`load_log_file` when creating the task scheduler\n   in function :code:`run_tuning`. Say,\n   :code:`tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)`\n4. If you have multiple target GPUs, you can use all of them for measurements to\n   parallelize the measurements. Check this `section <tutorials-autotvm-scale-up-rpc-tracker>`\n   to learn how to use the RPC Tracker and RPC Server.\n   To use the RPC Tracker in auto-scheduler, replace the runner in :code:`TuningOptions`\n   with :any:`auto_scheduler.RPCRunner`.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}