{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)\n**Author**: [Siju Samuel](https://github.com/siju-samuel)\n\nWelcome to part 3 of the Deploy Framework-Prequantized Model with TVM tutorial.\nIn this part, we will start with a Quantized TFLite graph and then compile and execute it via TVM.\n\n\nFor more details on quantizing the model using TFLite, readers are encouraged to\ngo through [Converting Quantized Models](https://www.tensorflow.org/lite/convert/quantization).\n\nThe TFLite models can be downloaded from this [link](https://www.tensorflow.org/lite/guide/hosted_models).\n\nTo get started, Tensorflow and TFLite package needs to be installed as prerequisite.\n\n```bash\n# install tensorflow and tflite\npip install tensorflow==2.1.0\npip install tflite==2.1.0\n```\nNow please check if TFLite package is installed successfully, ``python -c \"import tflite\"``\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Necessary imports\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n\nimport numpy as np\nimport tflite\n\nimport tvm\nfrom tvm import relay" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download pretrained Quantized TFLite model\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Download mobilenet V2 TFLite model provided by Google\nfrom tvm.contrib.download import download_testdata\n\nmodel_url = (\n \"https://storage.googleapis.com/download.tensorflow.org/models/\"\n \"tflite_11_05_08/mobilenet_v2_1.0_224_quant.tgz\"\n)\n\n# Download model tar file and extract it to get mobilenet_v2_1.0_224.tflite\nmodel_path = download_testdata(\n model_url, \"mobilenet_v2_1.0_224_quant.tgz\", module=[\"tf\", \"official\"]\n)\nmodel_dir = os.path.dirname(model_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Utils for downloading and extracting zip files\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def extract(path):\n import tarfile\n\n if path.endswith(\"tgz\") or path.endswith(\"gz\"):\n dir_path = os.path.dirname(path)\n tar = tarfile.open(path)\n tar.extractall(path=dir_path)\n tar.close()\n else:\n raise RuntimeError(\"Could not decompress the file: \" + path)\n\n\nextract(model_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load a test image\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a real image for e2e testing\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def get_real_image(im_height, im_width):\n from PIL import Image\n\n repo_base = \"https://github.com/dmlc/web-data/raw/main/tensorflow/models/InceptionV1/\"\n img_name = \"elephant-299.jpg\"\n image_url = os.path.join(repo_base, img_name)\n img_path = download_testdata(image_url, img_name, module=\"data\")\n image = Image.open(img_path).resize((im_height, im_width))\n x = np.array(image).astype(\"uint8\")\n data = np.reshape(x, (1, im_height, im_width, 3))\n return data\n\n\ndata = get_real_image(224, 224)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load a tflite model\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can open mobilenet_v2_1.0_224.tflite\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tflite_model_file = os.path.join(model_dir, \"mobilenet_v2_1.0_224_quant.tflite\")\ntflite_model_buf = open(tflite_model_file, \"rb\").read()\n\n# Get TFLite model from buffer\ntry:\n import tflite\n\n tflite_model = tflite.Model.GetRootAsModel(tflite_model_buf, 0)\nexcept AttributeError:\n import tflite.Model\n\n tflite_model = tflite.Model.Model.GetRootAsModel(tflite_model_buf, 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets run TFLite pre-quantized model inference and get the TFLite prediction.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def run_tflite_model(tflite_model_buf, input_data):\n \"\"\"Generic function to execute TFLite\"\"\"\n try:\n from tensorflow import lite as interpreter_wrapper\n except ImportError:\n from tensorflow.contrib import lite as interpreter_wrapper\n\n input_data = input_data if isinstance(input_data, list) else [input_data]\n\n interpreter = interpreter_wrapper.Interpreter(model_content=tflite_model_buf)\n interpreter.allocate_tensors()\n\n input_details = interpreter.get_input_details()\n output_details = interpreter.get_output_details()\n\n # set input\n assert len(input_data) == len(input_details)\n for i in range(len(input_details)):\n interpreter.set_tensor(input_details[i][\"index\"], input_data[i])\n\n # Run\n interpreter.invoke()\n\n # get output\n tflite_output = list()\n for i in range(len(output_details)):\n tflite_output.append(interpreter.get_tensor(output_details[i][\"index\"]))\n\n return tflite_output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets run TVM compiled pre-quantized model inference and get the TVM prediction.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def run_tvm(lib):\n from tvm.contrib import graph_executor\n\n rt_mod = graph_executor.GraphModule(lib[\"default\"](tvm.cpu(0)))\n rt_mod.set_input(\"input\", data)\n rt_mod.run()\n tvm_res = rt_mod.get_output(0).numpy()\n tvm_pred = np.squeeze(tvm_res).argsort()[-5:][::-1]\n return tvm_pred, rt_mod" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TFLite inference\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run TFLite inference on the quantized model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tflite_res = run_tflite_model(tflite_model_buf, data)\ntflite_pred = np.squeeze(tflite_res).argsort()[-5:][::-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TVM compilation and inference\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the TFLite-Relay parser to convert the TFLite pre-quantized graph into Relay IR. Note that\nfrontend parser call for a pre-quantized model is exactly same as frontend parser call for a FP32\nmodel. We encourage you to remove the comment from print(mod) and inspect the Relay module. You\nwill see many QNN operators, like, Requantize, Quantize and QNN Conv2D.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dtype_dict = {\"input\": data.dtype.name}\nshape_dict = {\"input\": data.shape}\n\nmod, params = relay.frontend.from_tflite(tflite_model, shape_dict=shape_dict, dtype_dict=dtype_dict)\n# print(mod)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets now the compile the Relay module. We use the \"llvm\" target here. Please replace it with the\ntarget platform that you are interested in.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "target = \"llvm\"\nwith tvm.transform.PassContext(opt_level=3):\n lib = relay.build_module.build(mod, target=target, params=params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, lets call inference on the TVM compiled module.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tvm_pred, rt_mod = run_tvm(lib)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accuracy comparison\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the top-5 labels for MXNet and TVM inference.\nChecking the labels because the requantize implementation is different between\nTFLite and Relay. This cause final output numbers to mismatch. So, testing accuracy via labels.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"TVM Top-5 labels:\", tvm_pred)\nprint(\"TFLite Top-5 labels:\", tflite_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Measure performance\nHere we give an example of how to measure performance of TVM compiled models.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n_repeat = 100 # should be bigger to make the measurement more accurate\ndev = tvm.cpu(0)\nprint(rt_mod.benchmark(dev, number=1, repeat=n_repeat))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Unless the hardware has special support for fast 8 bit instructions, quantized models are\n not expected to be any faster than FP32 models. Without fast 8 bit instructions, TVM does\n quantized convolution in 16 bit, even if the model itself is 8 bit.\n\n For x86, the best performance can be achieved on CPUs with AVX512 instructions set.\n In this case, TVM utilizes the fastest available 8 bit instructions for the given target.\n This includes support for the VNNI 8 bit dot product instruction (CascadeLake or newer).\n For EC2 C5.12x large instance, TVM latency for this tutorial is ~2 ms.\n\n Intel conv2d NCHWc schedule on ARM gives better end-to-end latency compared to ARM NCHW\n conv2d spatial pack schedule for many TFLite networks. ARM winograd performance is higher but\n it has a high memory footprint.\n\n Moreover, the following general tips for CPU performance equally applies:\n\n * Set the environment variable TVM_NUM_THREADS to the number of physical cores\n * Choose the best target for your hardware, such as \"llvm -mcpu=skylake-avx512\" or\n \"llvm -mcpu=cascadelake\" (more CPUs with AVX512 would come in the future)\n * Perform autotuning - `Auto-tuning a convolution network for x86 CPU\n