{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Auto-scheduling Sparse Matrix Multiplication on CPU with Custom Sketch Rule\n**Author**: [Chengfan Jia](https://github.com/jcf94/)\n\nThis is a tutorial on how to use the auto-scheduler to tune a sparse matrix multiplication for\nCPUs.\n\nAuto-scheduler is designed to explore the schedule with best performance for a given computation\ndeclaration automatically. While sometimes, we may have a demand to try some special ops which may\nnot been well-supported by auto-scheduler's default sketch rules and result in poor performance.\nFortunately, auto-scheduler currently allows user to provide a CustomSketch to cover these cases.\n\nWe use sparse matrix multiplication as an example in this tutorial to demonstrate how to implement\nand plug a custom sketch rule to the auto-scheduler's search policy.\n\nNote that this tutorial will not run on Windows or recent versions of macOS. To\nget it to run, you will need to wrap the body of this tutorial in a :code:`if\n__name__ == \"__main__\":` block.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n\nimport numpy as np\nimport tvm\nimport tvm.testing\nfrom tvm import te, auto_scheduler, runtime, topi\nfrom tvm.auto_scheduler import _ffi_api\nfrom tvm.topi.utils import get_const_tuple\nfrom tvm.topi.sparse.utils import random_bsr_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define the computation\nTo begin with, let us define the computation of a sparse matmul with several relu and bias add.\nThe function should return the list of input/output tensors.\nFrom these tensors, the auto-scheduler can get the whole computational graph.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "@auto_scheduler.register_workload\ndef sparse_dense(M, N, K, w_data_shape, w_indices_shape, w_indptr_shape, dtype):\n X = te.placeholder(shape=(M, K), dtype=dtype)\n W_data = te.placeholder(shape=w_data_shape, dtype=dtype)\n W_indices = te.placeholder(shape=w_indices_shape, dtype=\"int32\")\n W_indptr = te.placeholder(shape=w_indptr_shape, dtype=\"int32\")\n B = te.placeholder(shape=(M, N), dtype=dtype)\n\n out = topi.nn.sparse_dense(topi.nn.relu(X), W_data, W_indices, W_indptr)\n out = te.compute((M, N), lambda i, j: out[i, j] + B[i, j], name=\"BiasAdd\")\n out = topi.nn.relu(out)\n\n return [X, W_data, W_indices, W_indptr, B, out]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Special step for sparse workload\nDuring schedule tuning, auto-scheduler will use random inputs to measure the performance of a\ngenerated schedule. While we cannot directly use a random array as the input of a sparse op, for\nthe \"indices\" and \"indptr\" array are meaningful for the computation.\n\nTo solve this problem, we register these as special buffers, and load them when process program\nmeasuring.\nSee the `tvm.auto_scheduler.measure.py` for more details.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Define the basic shapes of this sparse computation\nM = 128\nK = 256\nN = 512\nBS_R = 16\nBS_C = 1\ndensity = 0.6\n\n# Generate the test data with numpy\nX_np = np.random.randn(M, K).astype(\"float32\")\nX_np = np.maximum(np.zeros((M, K), dtype=\"float32\"), X_np) # Relu\nW_sp_np = random_bsr_matrix(N, K, BS_R, BS_C, density=density, dtype=\"float32\")\nW_np = W_sp_np.todense()\nY_np = X_np @ W_np.T # Process the matrix multiplication\nB_np = np.random.randn(M, N).astype(\"float32\")\nY_np = Y_np + B_np # Bias add\nY_np = np.maximum(np.zeros((M, N), dtype=\"float32\"), Y_np) # Relu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create the search task\nWe then create a search task with M=N=K=512 and dtype=\"float32\"\nIf your machine supports avx instructions, you can\n\n - replace \"llvm\" below with \"llvm -mcpu=core-avx2\" to enable AVX2\n - replace \"llvm\" below with \"llvm -mcpu=skylake-avx512\" to enable AVX-512\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "target = tvm.target.Target(\"llvm\")\n\n# Register the sparse data to task inputs\nprefix = \"sparse_dense_bsr_%d_%d_%d_%d_%d_%d_\" % (\n N,\n K,\n BS_R,\n BS_C,\n W_sp_np.indices.shape[0],\n W_sp_np.indptr.shape[0],\n)\ntask = tvm.auto_scheduler.SearchTask(\n func=sparse_dense,\n args=(M, N, K, W_sp_np.data.shape, W_sp_np.indices.shape, W_sp_np.indptr.shape, \"float32\"),\n target=target,\n task_inputs={\n prefix + \"W_data\": runtime.ndarray.array(W_sp_np.data),\n prefix + \"W_indices\": runtime.ndarray.array(W_sp_np.indices),\n prefix + \"W_indptr\": runtime.ndarray.array(W_sp_np.indptr),\n },\n task_inputs_save_to_file=True,\n)\n\n# Inspect the computational graph\nprint(\"Computational DAG:\")\nprint(task.compute_dag)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write the custom sketch for sparse dense op\nBefore tuning, we will need to define the CustomSketchRule for the sparse dense op.\n\nCustomSketchRule consists of two parts: the condition function and the apply function.\n\n - condition function: describe when to apply this sketch rule. For example, we can only apply\n the rule to the sparse ops by matching their name and tag.\n - apply function: describe how to generate the initial sketch. You can implement it using\n auto-scheduler provided loop state APIs.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def meet_condition_func(search_policy, state, stage_id):\n state = auto_scheduler.loop_state.State(state, search_policy.search_task.compute_dag)\n if state.stages[stage_id].op.tag in [\n \"sparse_dense_sp_rhs_bsrmm\",\n \"sparse_dense_sp_rhs_bsrmm_block\",\n ]:\n return auto_scheduler.PreloadCustomSketchRule.APPLY_AND_SKIP_REST\n else:\n return auto_scheduler.PreloadCustomSketchRule.PASS\n\n\ndef apply_func(search_policy, state, stage_id):\n ret = []\n s0 = auto_scheduler.loop_state.State(state, search_policy.search_task.compute_dag)\n if s0.stages[stage_id].op.tag == \"sparse_dense_sp_rhs_bsrmm_block\":\n return [s0.state_object, stage_id - 1]\n\n sparse_dense = s0.stages[stage_id].op\n sparse_dense_block = s0.stages[stage_id - 1].op\n assert sparse_dense.tag == \"sparse_dense_sp_rhs_bsrmm\"\n assert sparse_dense_block.tag == \"sparse_dense_sp_rhs_bsrmm_block\"\n\n # Set the default consumer of compute block\n consumer = sparse_dense\n\n # If sparse dense has a single elementwise consumer\n # We can compute inline the sparse_dense output stage\n consumers = _ffi_api.SearchPolicyUtilsGetConsumers(\n search_policy.search_task, s0.state_object, stage_id\n )\n if len(consumers) == 1:\n consumer_id = int(consumers.items()[0][0])\n if _ffi_api.SearchPolicyUtilsIsElementwiseMatch(\n search_policy.search_task, s0.state_object, stage_id, consumer_id\n ):\n consumer = s0.stages[consumer_id].op\n s0.compute_inline(sparse_dense)\n\n i, nb_j, j, row_offset, c = s0[sparse_dense_block].iters\n m, n = s0[consumer].iters\n i0, i1, i2 = s0.split(sparse_dense_block, i, [None, None])\n m0, m1 = s0.follow_split(consumer, m, len(s0.transform_steps) - 1, 1)\n j0, j1 = s0.split(sparse_dense_block, nb_j, [None])\n n0, n1 = s0.follow_split(consumer, n, len(s0.transform_steps) - 1, 1)\n s0.reorder(sparse_dense_block, [i0, j0, i1, j1, row_offset, i2, j, c])\n s0.reorder(consumer, [m0, n0, m1, n1])\n s0.compute_at(sparse_dense_block, consumer, n0)\n\n ret.append([s0.state_object, stage_id - 2])\n\n return ret" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we set parameters for the auto-scheduler with the custom sketch plugged in.\n\n* :code:`num_measure_trials` is the number of measurement trials we can use during the search.\n We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a\n good value for the search to converge. You can do more trials according to your time budget.\n* In addition, we use :code:`RecordToFile` to dump measurement records into a file\n `sparse_dense.json`.\n The measurement records can be used to query the history best, resume the search,\n and do more analyses later.\n* see :any:`auto_scheduler.TuningOptions` for more parameters\n* Here, we need to create a :code:`auto_scheduler.SketchPolicy` object, and add the custom sketch\n rule as a `init_search_callbacks`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "log_file = \"sparse_dense.json\"\ntune_option = auto_scheduler.TuningOptions(\n num_measure_trials=10,\n measure_callbacks=[auto_scheduler.RecordToFile(log_file)],\n verbose=2,\n)\n\nsearch_policy = auto_scheduler.SketchPolicy(\n task,\n program_cost_model=auto_scheduler.XGBModel(),\n init_search_callbacks=[\n auto_scheduler.PreloadCustomSketchRule(meet_condition_func, apply_func, \"SparseDense\")\n ],\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run the search\nNow we get all inputs ready.\nWe can kick off the search and let the auto-scheduler do its magic.\nAfter some measurement trials, we can load the best schedule from the log\nfile and apply it.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Run auto-tuning (search)\n# Notice: We do not run the tuning in our webpage server since it takes too long.\n# Uncomment the following line to run it by yourself.\ntask.tune(tune_option, search_policy)\n\n# Apply the best schedule\nsch, args = task.apply_best(log_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can lower the schedule to see the IR after auto-scheduling.\nThe auto-scheduler correctly performs optimizations including multi-level tiling,\nlayout transformation, parallelization, vectorization, unrolling, and operator fusion.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"Lowered TIR:\")\nprint(tvm.lower(sch, args, simple_mode=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check correctness and evaluate performance\nWe build the binary and check its correctness and performance.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "func = tvm.build(sch, args, target)\n\ndev = tvm.cpu()\n\nX_tvm = tvm.nd.array(X_np, device=dev)\nW_data_tvm = tvm.nd.array(W_sp_np.data, device=dev)\nW_indices_tvm = tvm.nd.array(W_sp_np.indices, device=dev)\nW_indptr_tvm = tvm.nd.array(W_sp_np.indptr, device=dev)\nB_tvm = tvm.nd.array(B_np, device=dev)\nY_tvm = tvm.nd.empty(Y_np.shape, device=dev)\n\nfunc(X_tvm, W_data_tvm, W_indices_tvm, W_indptr_tvm, B_tvm, Y_tvm)\n\n# Check results\ntvm.testing.assert_allclose(Y_np, Y_tvm.numpy(), atol=1e-4, rtol=1e-4)\n\n# Evaluate execution time.\nevaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500)\nprint(\n \"Execution time of this operator: %.3f ms\"\n % (\n np.median(evaluator(X_tvm, W_data_tvm, W_indices_tvm, W_indptr_tvm, B_tvm, Y_tvm).results)\n * 1000\n )\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Note

Tuning result example\n\n```c\n----------------------------------------------------------------------\nLowered TIR:\nprimfn(placeholder_5: handle, placeholder_6: handle, placeholder_7: handle, placeholder_8: handle, placeholder_9: handle, compute_1: handle) -> ()\n attr = {\"global_symbol\": \"main\", \"tir.noalias\": True}\n buffers = {placeholder_2: Buffer(placeholder_10: Pointer(float32), float32, [9831, 16, 1], []),\n placeholder_4: Buffer(placeholder_11: Pointer(int32), int32, [33], []),\n placeholder_3: Buffer(placeholder_12: Pointer(float32), float32, [512, 512], []),\n compute: Buffer(compute_2: Pointer(float32), float32, [512, 512], []),\n placeholder_1: Buffer(placeholder_13: Pointer(float32), float32, [512, 512], []),\n placeholder: Buffer(placeholder_14: Pointer(int32), int32, [9831], [])}\n buffer_map = {placeholder_7: placeholder, placeholder_9: placeholder_1, placeholder_6: placeholder_2, compute_1: compute, placeholder_5: placeholder_3, placeholder_8: placeholder_4} {\n for (i0.outer.i1.outer.fused: int32, 0, 1024) \"parallel\" {\n attr [compute_3: Pointer(float32)] \"storage_scope\" = \"global\";\n allocate(compute_3, float32, [256]) {\n for (nb_j.inner: int32, 0, 2) {\n for (i.inner.init: int32, 0, 8) {\n for (j.init: int32, 0, 16) {\n compute_3[(((i.inner.init*32) + (nb_j.inner*16)) + j.init)] = 0f32\n }\n }\n for (elem_idx: int32, 0, ((int32*)placeholder_11[(((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner) + 1)] - (int32*)placeholder_11[((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)])) {\n for (i.inner: int32, 0, 8) {\n for (j: int32, 0, 16) {\n compute_3[(((i.inner*32) + (nb_j.inner*16)) + j)] = ((float32*)compute_3[(((i.inner*32) + (nb_j.inner*16)) + j)] + ((float32*)placeholder_10[((((int32*)placeholder_11[((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)]*16) + (elem_idx*16)) + j)]*max((float32*)placeholder_12[(((floordiv(i0.outer.i1.outer.fused, 16)*4096) + (i.inner*512)) + (int32*)placeholder_14[((int32*)placeholder_11[((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)] + elem_idx)])], 0f32)))\n }\n }\n }\n }\n for (i0.inner: int32, 0, 8) {\n compute_2[ramp((((floordiv(i0.outer.i1.outer.fused, 16)*4096) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 16)*32)), 1, 32)] = max(((float32x32*)compute_3[ramp((i0.inner*32), 1, 32)] + (float32x32*)placeholder_13[ramp((((floordiv(i0.outer.i1.outer.fused, 16)*4096) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 16)*32)), 1, 32)]), broadcast(0f32, 32))\n }\n }\n }\n}

\n```\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 0 }