编译模型库¶

要在任何平台上使用 MLC LLM 运行模型，需要：

模型权重 转换为 MLC 格式（例如 RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC 。）
模型库 包含推理逻辑

本页描述了如何使用 MLC LLM 编译模型库。模型编译针对给定平台优化模型推理，允许用户引入自己的新模型架构，使用不同的量化模式，并自定义整体模型优化流程。

值得注意的是，在许多情况下，您不需要显式调用编译。

如果您使用的是Python API，您可以跳过指定 model_lib，系统将即时编译库。
如果您正在构建iOS/Android包，请查看打包库和权重，它提供了更简单的高级命令，利用方案背后的编译。

本页面仍然有助于理解该方案背后的编译流程，或用于显式创建模型库。以 RedPajama-INCITE-Chat-3B-v1 为例，使用 q4f16_1 为所有平台进行编译。

备注

在继续之前，请确保你已经按照安装 TVM Unity 编译器的说明安装了 TVM Unity，这是使用 MLC LLM 编译模型所需的后端。

请同时按照 CLI / Python API 中的说明获取 CLI 应用程序 / Python API，以便与编译后的模型进行聊天。

0. 验证安装 ¶

步骤 1. 验证 mlc_llm

使用 Python 包 mlc_llm 来编译模型。可以通过安装 MLC LLM Python 包安装该包，无论是从源码构建还是安装预构建的包。可以通过以下命令在命令行中验证 mlc_llm 的安装：

$ mlc_llm --help
# You should see help information with this line
usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config}

备注

如果遇到错误提示 command not found: mlc_llm，可以尝试运行 python -m mlc_llm --help。

步骤 2. 验证 TVM

要编译模型，你还需要按照安装 TVM Unity 编译器的说明安装 TVM Unity。这里通过命令行快速验证 tvm （完整验证请参阅验证 TVM 安装）：

$ python -c "import tvm; print(tvm.__file__)"
/some-path/lib/python3.11/site-packages/tvm/__init__.py

1. 从 Hugging Face 克隆并转换权重 ¶

这一步复现了转换模型权重的过程，更多详细信息请参阅该页面。

你可以在 mlc-llm 仓库下操作，也可以在自己的工作目录中进行。请注意，所有平台可以共享相同的编译/量化权重。

# Create directory
mkdir -p dist/models && cd dist/models
# Clone HF weights
git lfs install
git clone https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1
cd ../..
# Convert weight
mlc_llm convert_weight ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC

2. 生成 mlc-chat-config 并编译 ¶

模型库由以下内容指定：

模型架构（例如 llama-2、gpt-neox）

量化方式（例如 q4f16_1、q0f32）

元数据（例如 context_window_size、sliding_window_size、prefill-chunk-size），这些会影响内存规划

平台（例如 cuda、webgpu、iOS）

所有这些选项都在 gen_config 生成的 mlc-chat-config.json 文件中指定。

# Create output directory for the model library compiled
mkdir dist/libs

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 --conv-template redpajama_chat \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device cuda -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so

对于 M 芯片 Mac：

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 --conv-template redpajama_chat \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device metal -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal.so

在 M 芯片 Mac 上为 Intel Mac 进行交叉编译：

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 --conv-template redpajama_chat \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device metal:x86-64 -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal_x86_64.dylib

对于 Intel Mac：

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 --conv-template redpajama_chat \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device metal -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal_x86_64.dylib

For Linux:

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 --conv-template redpajama_chat \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device vulkan -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.so

For Windows:

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 --conv-template redpajama_chat \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device vulkan -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.dll

你需要 Mac 来为其编译模型。

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ --quantization q4f16_1 \
    --conv-template redpajama_chat --context-window-size 768 \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device iphone -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-iphone.tar

备注

如果遇到错误

Compilation error:
xcrun: error: unable to find utility "metal", not a developer tool or in PATH
xcrun: error: unable to find utility "metallib", not a developer tool or in PATH

请检查并确保你已经正确安装了 Command Line Tools for Xcode。你可以使用 xcrun metal 进行验证：当它打印 metal: error: no input files 时，表示 Command Line Tools for Xcode 已安装且可以找到，你可以继续模型编译。

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ --quantization q4f16_1 \
    --conv-template redpajama_chat --context-window-size 768 \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device android -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-android.tar

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 --conv-template redpajama_chat \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device webgpu -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-webgpu.wasm

备注

要为 webgpu 编译，你需要在安装 mlc_llm 时从源码构建。此外，你还需要按照安装 Wasm 构建环境的说明进行操作。否则，可能会遇到错误

RuntimeError: Cannot find libraries: wasm_runtime.bc

备注

对于 webgpu，在编译较大的模型（如 Llama-2-7B）时，你可能需要添加 --prefill-chunk-size 1024 或降低 --context-window-size 以减少内存使用。否则，可能会遇到以下问题：

TypeError: Failed to execute 'createBuffer' on 'GPUDevice': Failed to read the 'size' property from
'GPUBufferDescriptor': Value is outside the 'unsigned long long' value range.

备注

对于 conv-template，conversation_template.py 包含了 MLC 提供的完整对话模板列表。如果你添加的模型需要新的对话模板，你需要自行添加。可以参考这个 PR 作为示例。不过，添加自定义模板需要你从源码构建 mlc_llm，以便运行时能够识别它。

更多详细信息，请参阅自定义 MLC Chat 配置。

3. 验证输出并进行聊天 ¶

通过执行上述编译命令，会生成模型权重、模型库和聊天配置文件。可以使用以下命令检查输出：

~/mlc-llm > ls dist/libs
  RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so      # ===> the model library

~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
  mlc-chat-config.json                             # ===> the chat config
  ndarray-cache.json                               # ===> the model weight info
  params_shard_0.bin                               # ===> the model weights
  params_shard_1.bin
  ...
  tokenizer.json                                   # ===> the tokenizer files
  tokenizer_config.json

现在可以使用命令行界面（CLI）应用程序或 Python API 与模型进行聊天。

python
>>> from mlc_llm import MLCEngine
>>> engine = MLCEngine(model="./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC",
...   model_lib="./dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so")
>>> engine.chat.completions.create(
...   messages=[{"role": "user", "content": "hello"}]
... )
ChatCompletionResponse(
  choices=[ChatCompletionResponseChoice(
    message=ChatCompletionMessage(
      content="Hi! How can I assist you today?", role='assistant'
    )
  )],
  ...
)

~/mlc-llm > ls dist/libs
  RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal.so     # ===> the model library (will be -metal_x86_64.dylib for Intel Mac)

~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
  mlc-chat-config.json                             # ===> the chat config
  ndarray-cache.json                               # ===> the model weight info
  params_shard_0.bin                               # ===> the model weights
  params_shard_1.bin
  ...
  tokenizer.json                                   # ===> the tokenizer files
  tokenizer_config.json

现在可以使用命令行界面（CLI）应用程序或 Python API 与模型进行聊天。

python
>>> from mlc_llm import MLCEngine
>>> engine = MLCEngine(model="./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC",
...   model_lib="./dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal.so")
>>> engine.chat.completions.create(
...   messages=[{"role": "user", "content": "hello"}]
... )
ChatCompletionResponse(
  choices=[ChatCompletionResponseChoice(
    message=ChatCompletionMessage(
      content="Hi! How can I assist you today?", role='assistant'
    )
  )],
  ...
)

~/mlc-llm > ls dist/libs
  RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.so    # ===> the model library (will be .dll for Windows)

~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
  mlc-chat-config.json                             # ===> the chat config
  ndarray-cache.json                               # ===> the model weight info
  params_shard_0.bin                               # ===> the model weights
  params_shard_1.bin
  ...
  tokenizer.json                                   # ===> the tokenizer files
  tokenizer_config.json

现在可以使用命令行界面（CLI）应用程序或 Python API 与模型进行聊天。

python
>>> from mlc_llm import MLCEngine
>>> engine = MLCEngine(model="./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC",
...   model_lib="./dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.so")
>>> engine.chat.completions.create(
...   messages=[{"role": "user", "content": "hello"}]
... )
ChatCompletionResponse(
  choices=[ChatCompletionResponseChoice(
    message=ChatCompletionMessage(
      content="Hi! How can I assist you today?", role='assistant'
    )
  )],
  ...
)

~/mlc-llm > ls dist/libs
  RedPajama-INCITE-Chat-3B-v1-q4f16_1-iphone.tar   # ===> the model library

~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
  mlc-chat-config.json                             # ===> the chat config
  ndarray-cache.json                               # ===> the model weight info
  params_shard_0.bin                               # ===> the model weights
  params_shard_1.bin
  ...
  tokenizer.json                                   # ===> the tokenizer files
  tokenizer_config.json

模型库 dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-iphone.tar 将作为静态库打包到 iOS 应用程序中。更多详细信息，请查看 iOS Swift SDK。

~/mlc-llm > ls dist/libs
  RedPajama-INCITE-Chat-3B-v1-q4f16_1-android.tar  # ===> the model library

~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
  mlc-chat-config.json                             # ===> the chat config
  ndarray-cache.json                               # ===> the model weight info
  params_shard_0.bin                               # ===> the model weights
  params_shard_1.bin
  ...
  tokenizer.json                                   # ===> the tokenizer files
  tokenizer_config.json

模型库 dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-android.tar 将作为静态库打包到 Android 应用程序中。更多详细信息，请查看 Android SDK。

~/mlc-llm > ls dist/libs
  RedPajama-INCITE-Chat-3B-v1-q4f16_1-webgpu.wasm  # ===> the model library

~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
  mlc-chat-config.json                             # ===> the chat config
  ndarray-cache.json                               # ===> the model weight info
  params_shard_0.bin                               # ===> the model weights
  params_shard_1.bin
  ...
  tokenizer.json                                   # ===> the tokenizer files
  tokenizer_config.json

要在 WebGPU 运行时中使用此模型，请查看 WebLLM Javascript SDK。

更多模型的编译命令 ¶

本节列出了更多模型的编译命令供你尝试。请注意，只要 mlc-llm 支持该架构，这些命令可以轻松推广到任何模型变体。

请先向 Meta 申请访问权限以获取 Llama-2 权重。获得访问权限后，首先创建目录 dist/models 并将模型下载到该目录。例如，你可以运行以下代码：

mkdir -p dist/models && cd dist/models
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
cd ../..

然后将 Hugging Face 权重转换为 MLC 兼容的权重。请注意，所有平台可以共享相同的编译/量化权重。

mlc_llm convert_weight ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC

之后，运行以下命令生成 mlc 配置文件并编译模型。

# Create output directory for the model library compiled
mkdir dist/libs

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
    --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
    --device cuda -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-cuda.so

对于 M 芯片 Mac：

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
    --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
    --device metal -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-metal.so

在 M 芯片 Mac 上为 Intel Mac 进行交叉编译：

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 --conv-template redpajama_chat \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device metal:x86-64 -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal_x86_64.dylib

对于 Intel Mac：

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
    --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
    --device metal -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-metal_x86_64.dylib

For Linux:

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
    --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
    --device vulkan -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-vulkan.so

For Windows:

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
    --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
    --device vulkan -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-vulkan.dll

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
    --context-window-size 2048 --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
    --device webgpu -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-webgpu.wasm

备注

要为 webgpu 编译，你需要在安装 mlc_llm 时从源码构建。此外，你还需要按照安装 Wasm 构建环境的说明进行操作。否则，可能会遇到错误

RuntimeError: Cannot find libraries: wasm_runtime.bc

你需要 Mac 来为其编译模型。

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
    --conv-template llama-2 --context-window-size 768 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
    --device iphone -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-iphone.tar

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
    --conv-template llama-2 --context-window-size 768 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
    --device android -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-android.tar

请注意，Mistral 使用滑动窗口注意力机制（SWA）。因此，不指定 context-window-size，而是指定 sliding-window-size。

首先创建目录 dist/models 并将模型下载到该目录。例如，你可以运行以下代码：

mkdir -p dist/models && cd dist/models
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
cd ../..

然后将 Hugging Face 权重转换为 MLC 兼容的权重。请注意，所有平台可以共享相同的编译/量化权重。

mlc_llm convert_weight ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
    -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC

之后，运行以下命令生成 mlc 配置文件并编译模型。

# Create output directory for the model library compiled
mkdir dist/libs

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
    --conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
    --device cuda -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-cuda.so

对于 M 芯片 Mac：

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
    --conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
    --device metal -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-metal.so

对于 Intel Mac：

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
    --conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
    --device metal -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-metal_x86_64.dylib

For Linux:

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
    --conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
    --device vulkan -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-vulkan.so

For Windows:

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
    --conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
    --device vulkan -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-vulkan.dll

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
    --prefill-chunk-size 1024 --conv-template mistral_default \
    -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
    --device webgpu -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-webgpu.wasm

备注

要为 webgpu 编译，你需要在安装 mlc_llm 时从源码构建。此外，你还需要按照安装 Wasm 构建环境的说明进行操作。否则，可能会遇到错误

RuntimeError: Cannot find libraries: wasm_runtime.bc

备注

对于 webgpu，在编译较大的模型（如 Llama-2-7B）时，你可能需要添加 --prefill-chunk-size 1024 或降低 --context-window-size 以减少内存使用。否则，可能会遇到以下问题：

TypeError: Failed to execute 'createBuffer' on 'GPUDevice': Failed to read the 'size' property from
'GPUBufferDescriptor': Value is outside the 'unsigned long long' value range.

你需要 Mac 来为其编译模型。

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
    --conv-template mistral_default --sliding-window-size 1024 --prefill-chunk-size 128  \
    -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
    --device iphone -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-iphone.tar

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
    --conv-template mistral_default --sliding-window-size 1024 --prefill-chunk-size 128 -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
    --device android -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-android.tar

首先创建目录 dist/models 并将模型下载到该目录。例如，你可以运行以下代码：

mkdir -p dist/models && cd dist/models
git lfs install
git clone https://huggingface.co/DISTRIBUTOR/HF_MODEL
cd ../..

然后将 Hugging Face 权重转换为 MLC 兼容的权重。请注意，所有平台可以共享相同的编译/量化权重。

mlc_llm convert_weight ./dist/models/HF_MODEL/ --quantization q4f16_1 -o dist/OUTPUT-MLC

之后，运行以下命令生成 mlc 配置文件并编译模型。

# Create output directory for the model library compiled
mkdir dist/libs

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device cuda -o dist/libs/OUTPUT-cuda.so

对于 M 芯片 Mac：

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device metal -o dist/libs/OUTPUT-metal.so

对于 Intel Mac：

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device metal -o dist/libs/OUTPUT-metal_x86_64.dylib

For Linux:

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device vulkan -o dist/libs/OUTPUT-vulkan.so

For Windows:

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device vulkan -o dist/libs/OUTPUT-vulkan.dll

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device webgpu -o dist/libs/OUTPUT-webgpu.wasm

备注

要为 webgpu 编译，你需要在安装 mlc_llm 时从源码构建。此外，你还需要按照安装 Wasm 构建环境的说明进行操作。否则，可能会遇到错误

RuntimeError: Cannot find libraries: wasm_runtime.bc

备注

对于 webgpu，在编译较大的模型（如 Llama-2-7B）时，你可能需要添加 --prefill-chunk-size 1024 或降低 --context-window-size 以减少内存使用。否则，可能会遇到以下问题：

TypeError: Failed to execute 'createBuffer' on 'GPUDevice': Failed to read the 'size' property from
'GPUBufferDescriptor': Value is outside the 'unsigned long long' value range.

你需要 Mac 来为其编译模型。

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE \
    --context-window-size 768 -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device iphone -o dist/libs/OUTPUT-iphone.tar

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE \
    --context-window-size 768 -o dist/OUTPUT-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device android -o dist/libs/OUTPUT-android.tar

对于每个模型和每个后端，上述内容仅提供了最推荐的构建命令（这是最优化的）。你也可以尝试使用不同的参数值（例如，不同的量化模式、上下文窗口大小等），这些构建结果会影响运行时的内存需求，但在运行模型时，它们可能不如提供的命令那样快速和稳定。

备注

使用 3 位量化通常可能过于激进，仅适用于有限的设置。如果你遇到编译后的模型表现不如预期的问题，请考虑使用更高位数的量化（例如 4 位量化）。

如果你除了本地执行外还对分发模型感兴趣，请查看（可选）3. 将权重上传至 Hugging Face。

编译命令规范 ¶

正如您在上面的部分中所看到的，模型编译分为三个步骤：转换权重、生成 mlc-chat-config.json 和编译模型。本节描述了编译过程中可以使用的选项列表。

1. 转换权重¶

权重转换命令遵循以下模式：

mlc_llm convert_weight \
    CONFIG \
    --quantization QUANTIZATION_MODE \
    [--model-type MODEL_TYPE] \
    [--device DEVICE] \
    [--source SOURCE] \
    [--source-format SOURCE_FORMAT] \
    --output OUTPUT

请注意，CONFIG 是位置参数。用 [ ] 包裹的参数是可选的。

--CONFIG

它可以是以下之一：

包含 config.json 的 HuggingFace 模型目录的路径，或
HuggingFace 格式的 config.json 的路径，或
预定义模型架构的名称。

HuggingFace 格式的 config.json 文件定义了模型架构，包括词汇表大小、层数、隐藏大小、注意力头数等。示例：https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json。

HuggingFace 目录通常包含一个定义模型架构的 config.json、PyTorch 或 SafeTensor 格式的非量化模型权重、分词器配置，以及可选的 generation_config.json，它提供了文本生成的额外默认配置。示例：https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main。

对于现有的预定义模型架构，请参见此处的 MODEL_PRESETS。

--quantization QUANTIZATION_MODE

用于编译的量化模式。

有关更多信息，请参阅量化模式。可用选项包括：q0f16、q0f32、q3f16_1、q4f16_1、q4f32_1 和 q4f16_awq。

鼓励您使用 4 位量化，因为 3 位量化模型生成的文本质量可能较差，具体取决于模型。

--model-type MODEL_TYPE

模型架构，例如“llama”。如果未设置，则从 config.json 推断。

--device DEVICE

用于量化的设备，例如“cuda”或“cuda:0”。如果未指定，将从本地可用的 GPU 中检测。

--source SOURCE

原始模型权重的路径，如果缺失则从 config 推断。

--source-format SOURCE_FORMAT

源模型权重的格式，如果缺失则从 config 推断。

--output OUTPUT

保存量化模型权重的输出目录。将在此目录中创建 params_shard_*.bin 和 `ndarray-cache.json`。

2. 生成 MLC Chat 配置文件¶

为了编译模型，我们首先需要生成 mlc-chat-config.json。该文件包含 context-window-size 和 sliding-window-size 等规范，以及其他可以改变模型编译的选项。还在这一步处理分词器。

配置文件生成命令遵循以下模式：

mlc_llm gen_config \
    CONFIG \
    --quantization QUANTIZATION_MODE \
    [--model-type MODEL_TYPE] \
    --conv-template CONV_TEMPLATE \
    [--context-window-size CONTEXT_WINDOW_SIZE] \
    [--sliding-window-size SLIDING_WINDOW_SIZE] \
    [--prefill-chunk-size PREFILL_CHUNK_SIZE] \
    [--tensor-parallel-shard TENSOR_PARALLEL_SHARDS] \
    --output OUTPUT

请注意，CONFIG 是位置参数。用 [ ] 包裹的参数是可选的。

--CONFIG

它可以是以下之一：

包含 config.json 的 HuggingFace 模型目录的路径，或
HuggingFace 格式的 config.json 的路径，或
预定义模型架构的名称。

HuggingFace 格式的 config.json 文件定义了模型架构，包括词汇表大小、层数、隐藏大小、注意力头数等。示例：https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json。

HuggingFace 目录通常包含一个定义模型架构的 config.json、PyTorch 或 SafeTensor 格式的非量化模型权重、分词器配置，以及可选的 generation_config.json，它提供了文本生成的额外默认配置。示例：https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main。

对于现有的预定义模型架构，请参见此处的 MODEL_PRESETS。

--quantization QUANTIZATION_MODE

用于编译的量化模式。

有关更多信息，请参阅量化模式。可用选项包括：q0f16、q0f32、q3f16_1、q4f16_1、q4f32_1 和 q4f16_awq。

鼓励您使用 4 位量化，因为 3 位量化模型生成的文本质量可能较差，具体取决于模型。

--model-type MODEL_TYPE

模型架构，例如“llama”。如果未设置，则从 config.json 推断。

--conv-template CONV_TEMPLATE

对话模板。它取决于模型的调优方式。对于普通的基模型，使用“LM”。对于现有的预定义模板，请参见此处的 CONV_TEMPLATES。

--context-window-size CONTEXT_WINDOW_SIZE

提供模型支持的最大序列长度的选项。这通常在模型卡中明确显示为上下文长度或上下文窗口。如果未明确设置此选项，默认情况下，它将由 config.json 中的 context_window_size 或 max_position_embeddings 决定，后者对于某些模型通常不准确。

--sliding-window-size SLIDING_WINDOW

（实验性）滑动窗口注意力（SWA）中的滑动窗口大小。此可选字段会覆盖 config.json 中的 sliding_window，适用于使用 SWA 的模型。目前仅在编译基于 mistral 的模型时有用。此标志可能会在未来进行重构。

--prefill-chunk-size PREFILL_CHUNK_SIZE

（实验性）预填充期间的块大小。默认情况下，块大小与 context_window_size 或 sliding_window_size 相同。此标志可能会在未来进行重构。

--tensor-parallel-shard TENSOR_PARALLEL_SHARDS

在张量并行多 GPU 推理中将模型拆分的分片数量。

--output OUTPUT

生成配置文件的输出目录，包括 mlc-chat-config.json 和分词器配置。

3. 编译模型库¶

生成 mlc-chat-config.json 后，我们可以将模型编译为模型库（以 .so、.tar 等结尾的文件，其中包含模型的推理逻辑）。

模型编译命令遵循以下模式：

mlc_llm compile \
    MODEL \
    [--quantization QUANTIZATION_MODE] \
    [--model-type MODEL_TYPE] \
    [--device DEVICE] \
    [--host HOST] \
    [--opt OPT] \
    [--system-lib-prefix SYSTEM_LIB_PREFIX] \
    --output OUTPUT \
    [--overrides OVERRIDES]

请注意，MODEL 是位置参数。用 [ ] 包裹的参数是可选的。

--MODEL

mlc-chat-config.json 的路径，或包含 mlc-chat-config.json 的 MLC 模型目录。

--quantization QUANTIZATION_MODE

用于编译的量化模式。如果未提供，将从 MODEL 推断。

有关更多信息，请参阅量化模式。可用选项包括：q0f16、q0f32、q3f16_1、q4f16_1、q4f32_1 和 q4f16_awq。

鼓励您使用 4 位量化，因为 3 位量化模型生成的文本质量可能较差，具体取决于模型。

--model-type MODEL_TYPE

模型架构，例如“llama”。如果未设置，则从 mlc-chat-config.json 推断。

--device DEVICE

编译模型的目标 GPU 设备。如果未设置，则从本地可用的 GPU 中推断。

--host HOST

编译模型的目标主机 LLVM 三元组。如果未设置，则从本地 CPU 和操作系统推断。LLVM 三元组的示例：

iPhones: arm64-apple-ios;
ARM64 Android phones: aarch64-linux-android;
WebAssembly: wasm32-unknown-unknown-wasm;
Windows: x86_64-pc-windows-msvc;
ARM macOS: arm64-apple-darwin.

--opt OPT

优化标志。MLC LLM 维护了一组预定义的优化标志，表示为 O0、O1、O2、O3，其中 O0 表示无优化，O2 表示大多数优化，O3 表示可能破坏系统的极端优化。

同时，可以通过详细选项明确指定优化标志，例如 --opt="cutlass_attn=1;cutlass_norm=0;cublas_gemm=0;cudagraph=0"。

--system-lib-prefix SYSTEM_LIB_PREFIX

为所有导出的符号添加前缀。类似于 objcopy --prefix-symbols。这在将多个模型编译到单个库中时非常有用，以避免符号冲突。与 objcopy 不同，这对共享库没有影响。

--output OUTPUT

输出文件的路径。后缀决定输出文件是共享库还是对象。可用的后缀：

Linux: .so (shared), .tar (objects);
macOS: .dylib (shared), .tar (objects);
Windows: .dll (shared), .tar (objects);
Android, iOS: .tar (objects);
Web: .wasm (web assembly).

--overrides OVERRIDES

模型配置覆盖。用于覆盖 mlc-chat-config.json 的配置。支持 context_window_size、prefill_chunk_size、sliding_window、max_batch_size 和 tensor_parallel_shards。同时，模型配置可以通过详细的参数显式指定，例如 --overrides "context_window_size=1024;prefill_chunk_size=128"。

编译模型库¶

0. 验证安装¶

1. 从 Hugging Face 克隆并转换权重¶

2. 生成 mlc-chat-config 并编译¶

3. 验证输出并进行聊天¶

更多模型的编译命令¶

编译命令规范¶

1. 转换权重¶

2. 生成 MLC Chat 配置文件¶

3. 编译模型库¶

0. 验证安装 ¶

1. 从 Hugging Face 克隆并转换权重 ¶

2. 生成 mlc-chat-config 并编译 ¶

3. 验证输出并进行聊天 ¶

更多模型的编译命令 ¶

编译命令规范 ¶