.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "how_to/tune_with_autotvm/tune_relay_x86.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_how_to_tune_with_autotvm_tune_relay_x86.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py:


.. _tune_relay_x86:

Auto-tuning a Convolutional Network for x86 CPU
===============================================
**Author**: `Yao Wang <https://github.com/kevinthesun>`_, `Eddie Yan <https://github.com/eqy>`_

This is a tutorial about how to tune convolution neural network
for x86 CPU.

Note that this tutorial will not run on Windows or recent versions of macOS. To
get it to run, you will need to wrap the body of this tutorial in a :code:`if
__name__ == "__main__":` block.

.. GENERATED FROM PYTHON SOURCE LINES 31-41

.. code-block:: default

    import os
    import numpy as np

    import tvm
    from tvm import relay, autotvm
    from tvm.relay import testing
    from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
    from tvm.autotvm.graph_tuner import DPTuner, PBQPTuner
    import tvm.contrib.graph_executor as runtime


.. GENERATED FROM PYTHON SOURCE LINES 42-50

Define network
--------------
First we need to define the network in relay frontend API.
We can either load some pre-defined network from :code:`relay.testing`
or building :any:`relay.testing.resnet` with relay.
We can also load models from MXNet, ONNX and TensorFlow.

In this tutorial, we choose resnet-18 as tuning example.

.. GENERATED FROM PYTHON SOURCE LINES 50-116

.. code-block:: default


    def get_network(name, batch_size):
        """Get the symbol definition and random weight of a network"""
        input_shape = (batch_size, 3, 224, 224)
        output_shape = (batch_size, 1000)

        if "resnet" in name:
            n_layer = int(name.split("-")[1])
            mod, params = relay.testing.resnet.get_workload(
                num_layers=n_layer, batch_size=batch_size, dtype=dtype
            )
        elif "vgg" in name:
            n_layer = int(name.split("-")[1])
            mod, params = relay.testing.vgg.get_workload(
                num_layers=n_layer, batch_size=batch_size, dtype=dtype
            )
        elif name == "mobilenet":
            mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size, dtype=dtype)
        elif name == "squeezenet_v1.1":
            mod, params = relay.testing.squeezenet.get_workload(
                batch_size=batch_size, version="1.1", dtype=dtype
            )
        elif name == "inception_v3":
            input_shape = (batch_size, 3, 299, 299)
            mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
        elif name == "mxnet":
            # an example for mxnet model
            from mxnet.gluon.model_zoo.vision import get_model

            block = get_model("resnet18_v1", pretrained=True)
            mod, params = relay.frontend.from_mxnet(block, shape={input_name: input_shape}, dtype=dtype)
            net = mod["main"]
            net = relay.Function(
                net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs
            )
            mod = tvm.IRModule.from_expr(net)
        else:
            raise ValueError("Unsupported network: " + name)

        return mod, params, input_shape, output_shape


    # Replace "llvm" with the correct target of your CPU.
    # For example, for AWS EC2 c5 instance with Intel Xeon
    # Platinum 8000 series, the target should be "llvm -mcpu=skylake-avx512".
    # For AWS EC2 c4 instance with Intel Xeon E5-2666 v3, it should be
    # "llvm -mcpu=core-avx2".
    target = "llvm"

    batch_size = 1
    dtype = "float32"
    model_name = "resnet-18"
    log_file = "%s.log" % model_name
    graph_opt_sch_file = "%s_graph_opt.log" % model_name

    # Set the input name of the graph
    # For ONNX models, it is typically "0".
    input_name = "data"

    # Set number of threads used for tuning based on the number of
    # physical CPU cores on your machine.
    num_threads = 1
    os.environ["TVM_NUM_THREADS"] = str(num_threads)


.. GENERATED FROM PYTHON SOURCE LINES 117-133

Configure tensor tuning settings and create tasks
-------------------------------------------------
To get better kernel execution performance on x86 CPU,
we need to change data layout of convolution kernel from
"NCHW" to "NCHWc". To deal with this situation, we define
conv2d_NCHWc operator in topi. We will tune this operator
instead of plain conv2d.

We will use local mode for tuning configuration. RPC tracker
mode can be setup similarly to the approach in
:ref:`tune_relay_arm` tutorial.

To perform a precise measurement, we should repeat the measurement several
times and use the average of results. In addition, we need to flush the cache
for the weight tensors between repeated measurements. This can make the measured
latency of one operator closer to its actual latency during end-to-end inference.

.. GENERATED FROM PYTHON SOURCE LINES 133-193

.. code-block:: default


    tuning_option = {
        "log_filename": log_file,
        "tuner": "random",
        "early_stopping": None,
        "measure_option": autotvm.measure_option(
            builder=autotvm.LocalBuilder(),
            runner=autotvm.LocalRunner(
                number=1, repeat=10, min_repeat_ms=0, enable_cpu_cache_flush=True
            ),
        ),
    }


    # You can skip the implementation of this function for this tutorial.
    def tune_kernels(
        tasks, measure_option, tuner="gridsearch", early_stopping=None, log_filename="tuning.log"
    ):

        for i, task in enumerate(tasks):
            prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))

            # create tuner
            if tuner == "xgb" or tuner == "xgb-rank":
                tuner_obj = XGBTuner(task, loss_type="rank")
            elif tuner == "ga":
                tuner_obj = GATuner(task, pop_size=50)
            elif tuner == "random":
                tuner_obj = RandomTuner(task)
            elif tuner == "gridsearch":
                tuner_obj = GridSearchTuner(task)
            else:
                raise ValueError("Invalid tuner: " + tuner)

            # do tuning
            n_trial = len(task.config_space)
            tuner_obj.tune(
                n_trial=n_trial,
                early_stopping=early_stopping,
                measure_option=measure_option,
                callbacks=[
                    autotvm.callback.progress_bar(n_trial, prefix=prefix),
                    autotvm.callback.log_to_file(log_filename),
                ],
            )


    # Use graph tuner to achieve graph level optimal schedules
    # Set use_DP=False if it takes too long to finish.
    def tune_graph(graph, dshape, records, opt_sch_file, use_DP=True):
        target_op = [
            relay.op.get("nn.conv2d"),
        ]
        Tuner = DPTuner if use_DP else PBQPTuner
        executor = Tuner(graph, {input_name: dshape}, records, target_op, target)
        executor.benchmark_layout_transform(min_exec_num=2000)
        executor.run()
        executor.write_opt_sch2record_file(opt_sch_file)


.. GENERATED FROM PYTHON SOURCE LINES 194-195

Finally, we launch tuning jobs and evaluate the end-to-end performance.

.. GENERATED FROM PYTHON SOURCE LINES 195-250

.. code-block:: default


    def evaluate_performance(lib, data_shape):
        # upload parameters to device
        dev = tvm.cpu()
        data_tvm = tvm.nd.array((np.random.uniform(size=data_shape)).astype(dtype))
        module = runtime.GraphModule(lib["default"](dev))
        module.set_input(input_name, data_tvm)

        # evaluate
        print("Evaluate inference time cost...")
        print(module.benchmark(dev, number=100, repeat=3))


    def tune_and_evaluate(tuning_opt):
        # extract workloads from relay program
        print("Extract tasks...")
        mod, params, data_shape, out_shape = get_network(model_name, batch_size)
        tasks = autotvm.task.extract_from_program(
            mod["main"], target=target, params=params, ops=(relay.op.get("nn.conv2d"),)
        )

        # run tuning tasks
        tune_kernels(tasks, **tuning_opt)
        tune_graph(mod["main"], data_shape, log_file, graph_opt_sch_file)

        # compile kernels in default mode
        print("Evaluation of the network compiled in 'default' mode without auto tune:")
        with tvm.transform.PassContext(opt_level=3):
            print("Compile...")
            lib = relay.build(mod, target=target, params=params)
            evaluate_performance(lib, data_shape)

        # compile kernels in kernel tuned only mode
        print("\nEvaluation of the network been tuned on kernel level:")
        with autotvm.apply_history_best(log_file):
            print("Compile...")
            with tvm.transform.PassContext(opt_level=3):
                lib = relay.build(mod, target=target, params=params)
            evaluate_performance(lib, data_shape)

        # compile kernels with graph-level best records
        print("\nEvaluation of the network been tuned on graph level:")
        with autotvm.apply_graph_best(graph_opt_sch_file):
            print("Compile...")
            with tvm.transform.PassContext(opt_level=3):
                lib = relay.build_module.build(mod, target=target, params=params)
            evaluate_performance(lib, data_shape)


    # We do not run the tuning in our webpage server since it takes too long.
    # Uncomment the following line to run it by yourself.

    # tune_and_evaluate(tuning_option)


.. GENERATED FROM PYTHON SOURCE LINES 251-299

Sample Output
-------------
The tuning needs to compile many programs and extract feature from them.
So a high performance CPU is recommended.
One sample output is listed below.

.. code-block:: bash

   Extract tasks...
   Tuning...
   [Task  1/12]  Current/Best:  598.05/2497.63 GFLOPS | Progress: (252/252) | 1357.95 s Done.
   [Task  2/12]  Current/Best:  522.63/2279.24 GFLOPS | Progress: (784/784) | 3989.60 s Done.
   [Task  3/12]  Current/Best:  447.33/1927.69 GFLOPS | Progress: (784/784) | 3869.14 s Done.
   [Task  4/12]  Current/Best:  481.11/1912.34 GFLOPS | Progress: (672/672) | 3274.25 s Done.
   [Task  5/12]  Current/Best:  414.09/1598.45 GFLOPS | Progress: (672/672) | 2720.78 s Done.
   [Task  6/12]  Current/Best:  508.96/2273.20 GFLOPS | Progress: (768/768) | 3718.75 s Done.
   [Task  7/12]  Current/Best:  469.14/1955.79 GFLOPS | Progress: (576/576) | 2665.67 s Done.
   [Task  8/12]  Current/Best:  230.91/1658.97 GFLOPS | Progress: (576/576) | 2435.01 s Done.
   [Task  9/12]  Current/Best:  487.75/2295.19 GFLOPS | Progress: (648/648) | 3009.95 s Done.
   [Task 10/12]  Current/Best:  182.33/1734.45 GFLOPS | Progress: (360/360) | 1755.06 s Done.
   [Task 11/12]  Current/Best:  372.18/1745.15 GFLOPS | Progress: (360/360) | 1684.50 s Done.
   [Task 12/12]  Current/Best:  215.34/2271.11 GFLOPS | Progress: (400/400) | 2128.74 s Done.
   INFO Start to benchmark layout transformation...
   INFO Benchmarking layout transformation successful.
   INFO Start to run dynamic programming algorithm...
   INFO Start forward pass...
   INFO Finished forward pass.
   INFO Start backward pass...
   INFO Finished backward pass...
   INFO Finished DPExecutor run.
   INFO Writing optimal schedules to resnet-18_graph_opt.log successfully.

   Evaluation of the network compiled in 'default' mode without auto tune:
   Compile...
   Evaluate inference time cost...
   Mean inference time (std dev): 4.5 ms (0.03 ms)

   Evaluation of the network been tuned on kernel level:
   Compile...
   Evaluate inference time cost...
   Mean inference time (std dev): 3.2 ms (0.03 ms)

   Evaluation of the network been tuned on graph level:
   Compile...
   Config for target=llvm -keys=cpu -link-params=0, workload=('dense_nopack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.
   Config for target=llvm -keys=cpu -link-params=0, workload=('dense_pack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.
   Evaluate inference time cost...
   Mean inference time (std dev): 3.16 ms (0.03 ms)