{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)\n**Author**: [Siju Samuel](https://github.com/siju-samuel)\n\nWelcome to part 3 of the Deploy Framework-Prequantized Model with TVM tutorial.\nIn this part, we will start with a Quantized TFLite graph and then compile and execute it via TVM.\n\n\nFor more details on quantizing the model using TFLite, readers are encouraged to\ngo through [Converting Quantized Models](https://www.tensorflow.org/lite/convert/quantization).\n\nThe TFLite models can be downloaded from this [link](https://www.tensorflow.org/lite/guide/hosted_models).\n\nTo get started, Tensorflow and TFLite package needs to be installed as prerequisite.\n\n```bash\n# install tensorflow and tflite\npip install tensorflow==2.1.0\npip install tflite==2.1.0\n```\nNow please check if TFLite package is installed successfully, ``python -c \"import tflite\"``\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        ""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Necessary imports\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import os\n\nimport numpy as np\nimport tflite\n\nimport tvm\nfrom tvm import relay"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Download pretrained Quantized TFLite model\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Download mobilenet V2 TFLite model provided by Google\nfrom tvm.contrib.download import download_testdata\n\nmodel_url = (\n    \"https://storage.googleapis.com/download.tensorflow.org/models/\"\n    \"tflite_11_05_08/mobilenet_v2_1.0_224_quant.tgz\"\n)\n\n# Download model tar file and extract it to get mobilenet_v2_1.0_224.tflite\nmodel_path = download_testdata(\n    model_url, \"mobilenet_v2_1.0_224_quant.tgz\", module=[\"tf\", \"official\"]\n)\nmodel_dir = os.path.dirname(model_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Utils for downloading and extracting zip files\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def extract(path):\n    import tarfile\n\n    if path.endswith(\"tgz\") or path.endswith(\"gz\"):\n        dir_path = os.path.dirname(path)\n        tar = tarfile.open(path)\n        tar.extractall(path=dir_path)\n        tar.close()\n    else:\n        raise RuntimeError(\"Could not decompress the file: \" + path)\n\n\nextract(model_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Load a test image\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Get a real image for e2e testing\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def get_real_image(im_height, im_width):\n    from PIL import Image\n\n    repo_base = \"https://github.com/dmlc/web-data/raw/main/tensorflow/models/InceptionV1/\"\n    img_name = \"elephant-299.jpg\"\n    image_url = os.path.join(repo_base, img_name)\n    img_path = download_testdata(image_url, img_name, module=\"data\")\n    image = Image.open(img_path).resize((im_height, im_width))\n    x = np.array(image).astype(\"uint8\")\n    data = np.reshape(x, (1, im_height, im_width, 3))\n    return data\n\n\ndata = get_real_image(224, 224)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Load a tflite model\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we can open mobilenet_v2_1.0_224.tflite\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tflite_model_file = os.path.join(model_dir, \"mobilenet_v2_1.0_224_quant.tflite\")\ntflite_model_buf = open(tflite_model_file, \"rb\").read()\n\n# Get TFLite model from buffer\ntry:\n    import tflite\n\n    tflite_model = tflite.Model.GetRootAsModel(tflite_model_buf, 0)\nexcept AttributeError:\n    import tflite.Model\n\n    tflite_model = tflite.Model.Model.GetRootAsModel(tflite_model_buf, 0)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Lets run TFLite pre-quantized model inference and get the TFLite prediction.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def run_tflite_model(tflite_model_buf, input_data):\n    \"\"\"Generic function to execute TFLite\"\"\"\n    try:\n        from tensorflow import lite as interpreter_wrapper\n    except ImportError:\n        from tensorflow.contrib import lite as interpreter_wrapper\n\n    input_data = input_data if isinstance(input_data, list) else [input_data]\n\n    interpreter = interpreter_wrapper.Interpreter(model_content=tflite_model_buf)\n    interpreter.allocate_tensors()\n\n    input_details = interpreter.get_input_details()\n    output_details = interpreter.get_output_details()\n\n    # set input\n    assert len(input_data) == len(input_details)\n    for i in range(len(input_details)):\n        interpreter.set_tensor(input_details[i][\"index\"], input_data[i])\n\n    # Run\n    interpreter.invoke()\n\n    # get output\n    tflite_output = list()\n    for i in range(len(output_details)):\n        tflite_output.append(interpreter.get_tensor(output_details[i][\"index\"]))\n\n    return tflite_output"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Lets run TVM compiled pre-quantized model inference and get the TVM prediction.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def run_tvm(lib):\n    from tvm.contrib import graph_executor\n\n    rt_mod = graph_executor.GraphModule(lib[\"default\"](tvm.cpu(0)))\n    rt_mod.set_input(\"input\", data)\n    rt_mod.run()\n    tvm_res = rt_mod.get_output(0).numpy()\n    tvm_pred = np.squeeze(tvm_res).argsort()[-5:][::-1]\n    return tvm_pred, rt_mod"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## TFLite inference\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run TFLite inference on the quantized model.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tflite_res = run_tflite_model(tflite_model_buf, data)\ntflite_pred = np.squeeze(tflite_res).argsort()[-5:][::-1]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## TVM compilation and inference\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We use the TFLite-Relay parser to convert the TFLite pre-quantized graph into Relay IR. Note that\nfrontend parser call for a pre-quantized model is exactly same as frontend parser call for a FP32\nmodel. We encourage you to remove the comment from print(mod) and inspect the Relay module. You\nwill see many QNN operators, like, Requantize, Quantize and QNN Conv2D.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "dtype_dict = {\"input\": data.dtype.name}\nshape_dict = {\"input\": data.shape}\n\nmod, params = relay.frontend.from_tflite(tflite_model, shape_dict=shape_dict, dtype_dict=dtype_dict)\n# print(mod)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Lets now the compile the Relay module. We use the \"llvm\" target here. Please replace it with the\ntarget platform that you are interested in.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "target = \"llvm\"\nwith tvm.transform.PassContext(opt_level=3):\n    lib = relay.build_module.build(mod, target=target, params=params)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, lets call inference on the TVM compiled module.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tvm_pred, rt_mod = run_tvm(lib)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Accuracy comparison\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Print the top-5 labels for MXNet and TVM inference.\nChecking the labels because the requantize implementation is different between\nTFLite and Relay. This cause final output numbers to mismatch. So, testing accuracy via labels.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"TVM Top-5 labels:\", tvm_pred)\nprint(\"TFLite Top-5 labels:\", tflite_pred)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Measure performance\nHere we give an example of how to measure performance of TVM compiled models.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "n_repeat = 100  # should be bigger to make the measurement more accurate\ndev = tvm.cpu(0)\nprint(rt_mod.benchmark(dev, number=1, repeat=n_repeat))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>Unless the hardware has special support for fast 8 bit instructions, quantized models are\n  not expected to be any faster than FP32 models. Without fast 8 bit instructions, TVM does\n  quantized convolution in 16 bit, even if the model itself is 8 bit.\n\n  For x86, the best performance can be achieved on CPUs with AVX512 instructions set.\n  In this case, TVM utilizes the fastest available 8 bit instructions for the given target.\n  This includes support for the VNNI 8 bit dot product instruction (CascadeLake or newer).\n  For EC2 C5.12x large instance, TVM latency for this tutorial is ~2 ms.\n\n  Intel conv2d NCHWc schedule on ARM gives better end-to-end latency compared to ARM NCHW\n  conv2d spatial pack schedule for many TFLite networks. ARM winograd performance is higher but\n  it has a high memory footprint.\n\n  Moreover, the following general tips for CPU performance equally applies:\n\n   * Set the environment variable TVM_NUM_THREADS to the number of physical cores\n   * Choose the best target for your hardware, such as \"llvm -mcpu=skylake-avx512\" or\n     \"llvm -mcpu=cascadelake\" (more CPUs with AVX512 would come in the future)\n   * Perform autotuning - `Auto-tuning a convolution network for x86 CPU\n     <tune_relay_x86>`.\n   * To get best inference performance on ARM CPU, change target argument\n     according to your device and follow `Auto-tuning a convolution\n     network for ARM CPU <tune_relay_arm>`.</p></div>\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}