Relay 算子策略#

为了将 Relay 算子 lower 到 TOPI 库中定义的实现,需要向每个 Relay 算子注册 compute 和 schedule 函数。然而,compute 和 schedule 函数通常是针对每个 target 的,而且,甚至对于同一个 target,可能有多个算法和实现可用。为了解决问题的复杂性,引入了算子策略,允许开发者为每个算子和目标定义灵活的 lowering 策略。

算子策略设计#

算子策略的基本元素是 OpImplementation。它包括一对计算和调度函数、实现的名称和优先级级别(优先级级别的使用在 [Op 策略中选择实现](Op 策略中选择实现) 中有解释)。

OpStrategy 包含一组 OpSpecialization。每个 OpSpecialization 都包含一组与 SpecializedCondition 相关联的 OpImplementation (参见 include/tvm/te/schedule.h 中的定义)。SpecializedCondition 可以为 null,表示实现是普遍适用的;否则,只在满足特定条件时才考虑实现。 SpecializedCondition 由一组在合取范式张量表达式(conjunctive normal form,即 CNF)中定义的子句组成,只支持张量 shapes 上的条件。

最后,策略函数或 FTVMStrategy 决定给定工作负载应该使用哪一对计算和调度函数,并且需要注册到每个 Relay 算子。FTVMStrategy 是 generic 函数(参见 include/tvm/target/generic_func.h ),它可以被每个目标覆盖。函数签名为

OpStrategy(const Attrs& attrs, const Array<Tensor>& inputs, const Type& out_type, const Target& target)

函数返回给定 op 属性、输入张量、输出类型和要编译的 target 的 OpStrategy

编写策略函数#

建议开发人员用 Python 编写策略函数,因为大多数 TOPI 计算和调度函数都是用 Python 编写的。在 Python 中,在 pyton/tvm/relay/op/op.py 中提供 OpStrategy 类。它只有一个 API,就是向策略中添加实现:

def add_implementation(self, compute, schedule, name="default", plevel=10)

topk 为例来解释如何编写 FTVMStrategy 函数:

# 添加到 python/tvm/relay/op/strategy/generic.py
@override_native_generic_func("topk_strategy")
def topk_strategy(attrs, inputs, out_type, target):
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_topk(topi.topk),
        wrap_topi_schedule(topi.generic.schedule_topk),
        name="topk.generic")
    return strategy

# add to each target file in python/tvm/relay/op/strategy, e.g., x86.py, cuda.py, etc.
@topk_strategy.register(["cuda", "gpu"])
def topk_strategy_cuda(attrs, inputs, out_type, target):
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_my_new_op(topi.cuda.topk),
        wrap_topi_schedule(topi.cuda.schedule_topk),
        name="topk.cuda")
    return strategy

In this example, we use topi.cuda.topk and topi.cuda.schedule_topk as the compute and schedule function for CUDA or GPU target, while use TOPI generic compute and schedule for the rest of targets. Note that we use two wrapper functions that wrap the topi compute and schedule to conform with the required function signature ( see FTVMCompute and FTVMSchedule in include/tvm/relay/op_attr_types.h). Usually we need to write a customized compute wrapper function for each operator to get different fields from op attributes.

The example above shows a very basic strategy function that only adds one implementation in the strategy. But for many complicated operators, we may need to add multiple implementations that use different algorithms. For example, we can use both direct and winograd algorithm to compute a conv2d op. In order to achieve this, we can write the strategy function as follows:

… code:: python

strategy.add_implementation(
    wrap_compute_conv2d(topi.cuda.conv2d_nchw),
    wrap_topi_schedule(topi.cuda.schedule_conv2d_nchw),
    name="conv2d_nchw.cuda",
    plevel=10)

if winograd_condition:
    strategy.add_implementation(
        wrap_compute_conv2d(topi.cuda.conv2d_nchw_winograd),
        wrap_topi_schedule(topi.cuda.schedule_conv2d_nchw_winograd),
        name="conv2d_nchw_winograd.cuda",
        plevel=15)

In this example, we add two implementations to the conv2d strategy where winograd algorithm is only added when winograd_condition is true. The implementation "conv2d_nchw_winograd.cuda" will be used to compile conv2d when winograd_condition is true as it has higher priority level (this could be changed if certain implementation is an AutoTVM template. See Select Implementation from Op Strategy_ for more details). Otherwise, "conv2d_nchw.cuda" is used.

We can extend the example above to third party library implementation. For example, we can add the implementation that invokes kernel in the cblas library when cblas is included in the target.

… code:: python

if "cblas" in target.libs:
    strategy.add_implementation(
        wrap_compute_dense(topi.x86.dense_cblas),
        wrap_topi_schedule(topi.x86.schedule_dense_cblas),
        name="dense_cblas.x86",
        plevel=15)

Further, we can add implementation specialized for a certain range of shapes. The code below shows an example of dense strategy that adds an implementation that is specialized for m greater than 16. The main difference between hardcode python condition like examples above and specialized condition is that it allows TVM to generate multiple kernels when the input tensors have symbolic shapes. The compile engine will generate a dispatch function that invokes the specialized kernel when the corresponding condition is met; otherwise, invoke the kernel that has no associated specialized condition (dense_common in this example). This part is still work in progress. More details will be provided after it is done.

… code:: python

def dense_strategy(attrs, inputs, out_type, target):
    m = inputs[0].shape[0]
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_dense(dense_compute1),
        wrap_topi_schedule(dense_schedule1),
        name="dense_common")

    with tvm.te.SpecializedCondition(m > 16):
        strategy.add_implementation(
            wrap_compute_dense(dense_compute2),
            wrap_topi_schedule(dense_schedule2),
            name="dense_for_large_m",
            plevel=15)

    return strategy

Register Strategy Function to An Operator#

After we define the strategy function for an operator, we can now register the strategy function to this operator with

… code:: python

register_strategy("topk", strategy.topk_strategy)

However, it takes much effort to write a strategy function for an operator. Therefore, we provide two other methods for simpler operators.

First, for operators that have injective, broadcast, or reduction pattern, we can call register_injective_schedule, register_broadcast_schedule, and register_reduce_schedule repsectively. The schedule function for these patterns are already registered by each target and can be applied to these operators. We assume the compute function should be the same across all targets, and FTVMCompute needs to be registered to the op before invoking register schedule.

… code:: python

register_broadcast_schedule("add")

Second, for operators that doesn’t have these common patterns mentioned before, but also have the same compute function for all targets, we can use register_schedule API. It is easier to write FTVMSchedule function as we only need to provide which schedule function to use. The following code snippet shows FTVMSchedule function for pooling.

… code:: python

# add to python/tvm/relay/op/strategy/generic.py
@generic_func
def schedule_pool(attrs, outs, target):
    with target:
        return topi.generic.schedule_pool(outs, attrs.layout)

# add to each target file in python/tvm/relay/op/strategy, e.g., x86.py, cuda.py, etc.
@schedule_pool.register("cpu")
def schedule_pool_cpu(attrs, outs, target):
    ...

After we created the FTVMSchedule for an operator, we can register the strategy using register_schedule:

… code:: python

register_schedule("nn.max_pool2d", strategy.schedule_pool)

Register Strategies for A New Target#

There are two ways to register strategies for a new target. The more straightforward one is adding a new target file in the directory python/tvm/relay/op/strategy. You only need to customize the strategy for ops that have been implemented for this new target and reuse the generic strategies for the rest.

Alternatively, you can also register the strategy for the new target outside the TVM python library. The following code snippet shows an example how to do so. You can find more examples in vta/python/vta/top/op.py.

… code:: python

@relay.op.strategy.conv2d_strategy.register("mytarget")
def conv2d_strategy_mytarget(attrs, inputs, out_type, target):
    ...

Op 策略中选择实现#

During the compilation, Relay compile engine needs to determine which implementation to use for an operator when there are multiple. The selection policy works as follows.

When the input tensors to an operator or a fused op all have constant shapes, the compile engine first finds the best implementation based on AutoTVM tuning logs. If there is no implementation that is an AutoTVM template or all AutoTVM templates have fallback configs, the implementation with highest priority level will then be chosen. Implementations with same priority level in this case leads to an undefined behavior, and any of them might be selected.

The selection policy for ops with symbolic input shapes is still work in progress. Currently, if any input tensor has a symbolic shape, only the implementation with highest priority level will be used for this operator. This will be updated after the implementation finishes.

For debug purpose, you can add the following lines before you compile the Relay model to learn which implementation is used for each operator.

… code:: python

logging.getLogger("te_compiler").setLevel(logging.INFO)
logging.getLogger("te_compiler").addHandler(logging.StreamHandler(sys.stdout))