WebLLM Javascript SDK¶

WebLLM 是高性能的浏览器内 LLM 推理引擎，旨在成为 AI 驱动的 Web 应用程序和代理的后端。

它为 MLCEngine 的 Web 后端提供了专门的运行时，利用 WebGPU 进行本地加速，提供与 OpenAI 兼容的 API，并内置支持 Web Workers 以将繁重的计算与 UI 流程分离。

请查看 WebLLM 仓库了解如何使用 WebLLM 在 Javascript/Typescript 中构建 Web 应用程序。这里我们仅提供高层次的概念，并讨论如何使用 MLC-LLM 编译您自己的模型以与 WebLLM 一起运行。

快速入门 ¶

要开始使用，请尝试 WebLLM Chat，它提供了将 WebLLM 集成到完整 Web 应用程序中的绝佳示例。

运行 WebLLM 驱动的 Web 应用程序需要支持 WebGPU 的浏览器。您可以下载最新的 Google Chrome 并使用 WebGPU Report 来验证浏览器上的 WebGPU 功能。

WebLLM 可通过 npm 包获取，也可以通过 CDN 提供。在这个 JSFiddle 示例中尝试简单的聊天机器人示例，无需设置。

您还可以查看现有示例，了解 WebLLM 的更高级用法，例如 JSON 模式、流式传输等。

WebLLM 中的模型记录 ¶

WebLLM Chat 中的每个模型都注册为 ModelRecord 的实例，可以在 webllm.prebuiltAppConfig.model_list 中访问。

查看最简单的示例 get-started，有两种运行模型的方法。

可以通过简单地使用 model_id 调用 reload() 来使用预构建的模型：

const selectedModel = "Llama-3-8B-Instruct-q4f32_1-MLC";
const engine = await webllm.CreateMLCEngine(selectedModel);

或者可以通过创建模型记录来指定要运行的自己的模型：

const appConfig: webllm.AppConfig = {
  model_list: [
    {
      model: "https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f32_1-MLC",
      model_id: "Llama-3-8B-Instruct-q4f32_1-MLC",
      model_lib:
        webllm.modelLibURLPrefix +
        webllm.modelVersion +
        "/Llama-3-8B-Instruct-q4f32_1-ctx4k_cs1k-webgpu.wasm",
    },
    // Add your own models here...
  ],
};
const selectedModel = "Llama-3-8B-Instruct-q4f32_1-MLC";
const engine: webllm.MLCEngineInterface = await webllm.CreateMLCEngine(
  selectedModel,
  { appConfig: appConfig },
);

查看上面的代码，发现，就像 MLC-LLM 支持的其他平台一样，要在 WebLLM 上运行模型，您需要：

模型权重 转换为 MLC 格式（例如 Llama-3-8B-Instruct-q4f32_1-MLC.): 通过 ModelRecord.model 的 URL 下载。
模型库 包含推理逻辑（参见仓库 binary-mlc-llm-libs): 通过 ModelRecord.model_lib 的 URL 下载。

在下面的部分中，将带您了解两个示例，展示如何在 webllm.prebuiltAppConfig.model_list 之外添加您自己的模型。在继续之前，请确保已安装 mlc_llm 和 tvm。

验证添加模型的安装 ¶

步骤 1. 验证 mlc_llm

使用 python 包 mlc_llm 来编译模型。可以通过以下方式安装：安装 MLC LLM Python 包，无论是从源代码构建，还是安装预构建的包。通过以下命令验证 mlc_llm 的安装：

$ mlc_llm --help
# You should see help information with this line
usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config}

备注

如果出现错误 command not found: mlc_llm，请尝试 python -m mlc_llm --help。

步骤 2. 验证 TVM

要编译模型，您还需要按照安装 TVM Unity 编译器进行操作。这里通过命令行快速验证 tvm （完整验证请参见验证 TVM 安装）：

$ python -c "import tvm; print(tvm.__file__)"
/some-path/lib/python3.11/site-packages/tvm/__init__.py

引入您自己的模型变体 ¶

在您添加的模型仅仅是现有模型的变体的情况下，只需要转换权重并重用现有的模型库。例如：

在 MLC 支持 Mistral 时添加 OpenMistral
在 MLC 支持 Llama3 时添加在特定领域任务上微调的 Llama3

在本节中，我们将带您了解如何将 WizardMath-7B-V1.1-q4f16_1 添加到 get-started 示例中。根据其 Huggingface 仓库中的 config.json，它重用了 Mistral 模型架构。

备注

本节大部分内容复制自转换模型权重。请参阅该页面以获取更多详细信息。请注意，权重在 MLC 中跨所有平台共享。

步骤 1 从 HF 克隆并转换权重

您可以在 mlc-llm 仓库下，或您自己的工作目录中。请注意，所有平台可以共享相同的编译/量化权重。有关 convert_weight 的规范，请参阅编译命令规范。

# Create directory
mkdir -p dist/models && cd dist/models
# Clone HF weights
git lfs install
git clone https://huggingface.co/WizardLM/WizardMath-7B-V1.1
cd ../..
# Convert weight
mlc_llm convert_weight ./dist/models/WizardMath-7B-V1.1/ \
    --quantization q4f16_1 \
    -o dist/WizardMath-7B-V1.1-q4f16_1-MLC

步骤 2 生成 MLC Chat 配置文件

使用 mlc_llm gen_config 生成 mlc-chat-config.json 并处理分词器。有关 gen_config 的规范，请参阅编译命令规范。

mlc_llm gen_config ./dist/models/WizardMath-7B-V1.1/ \
    --quantization q4f16_1 --conv-template wizard_coder_or_math \
    -o dist/WizardMath-7B-V1.1-q4f16_1-MLC/

对于 conv-template，conversation_template.py 包含了 MLC 提供的完整对话模板列表。您也可以手动修改 mlc-chat-config.json 以添加自定义的对话模板。

步骤 3 将权重上传至 HF

# First, please create a repository on Hugging Face.
# With the repository created, run
git lfs install
git clone https://huggingface.co/my-huggingface-account/my-wizardMath-weight-huggingface-repo
cd my-wizardMath-weight-huggingface-repo
cp path/to/mlc-llm/dist/WizardMath-7B-V1.1-q4f16_1-MLC/* .
git add . && git commit -m "Add wizardMath model weights"
git push origin main

成功完成所有步骤后，您应该会得到类似于 WizardMath-7B-V1.1-q4f16_1-MLC 的 Huggingface 仓库，其中包含转换/量化后的权重、mlc-chat-config.json 配置文件以及分词器文件。

步骤 4 注册为 ModelRecord

最后，修改上面粘贴的 get-started 代码片段。

只需将 Huggingface 链接指定为 model，同时重用 Mistral-7B 的 model_lib。

const appConfig: webllm.AppConfig = {
  model_list: [
    {
      model: "https://huggingface.co/mlc-ai/WizardMath-7B-V1.1-q4f16_1-MLC",
      model_id: "WizardMath-7B-V1.1-q4f16_1-MLC",
      model_lib:
        webllm.modelLibURLPrefix +
        webllm.modelVersion +
        "/Mistral-7B-Instruct-v0.3-q4f16_1-ctx4k_cs1k-webgpu.wasm",
    },
    // Add your own models here...
  ],
};

const selectedModel = "WizardMath-7B-V1.1-q4f16_1"
const engine: webllm.MLCEngineInterface = await webllm.CreateMLCEngine(
  selectedModel,
  { appConfig: appConfig },
);

现在，运行 get-started 示例将使用您刚刚添加的 WizardMath 模型。请参阅 get-started 的 README 了解如何运行它。

引入您自己的模型库 ¶

模型库由以下内容指定：

模型架构（例如 llama-3、gpt-neox、phi-3）

量化（例如 q4f16_1、q0f32）

元数据（例如 context_window_size、sliding_window_size、prefill-chunk-size），这会影响内存规划（目前只有 prefill-chunk-size 会影响编译后的模型）

平台（例如 cuda、webgpu、iOS）

在您要运行的模型与提供的 MLC 预构建模型库不兼容的情况下（例如具有不同的量化、不同的元数据规范，甚至不同的模型架构），您需要构建自己的模型库。

在本节中，将带您了解如何将 RedPajama-INCITE-Chat-3B-v1 添加到 get-started 示例中。

本节在很大程度上复制了编译模型库。有关更多详细信息，请参阅该页面，特别是 WebGPU 选项。

步骤 0. 安装依赖项

要为 webgpu 编译模型库，您需要从源代码构建 mlc_llm。此外，您还需要按照安装 Wasm 构建环境进行操作。否则，会出现以下错误：

RuntimeError: Cannot find libraries: wasm_runtime.bc

第一步：从 HF 克隆并转换权重

您可以在 mlc-llm 仓库下，或者您自己的工作目录下进行操作。请注意，所有平台可以共享相同的编译/量化权重。

# Create directory
mkdir -p dist/models && cd dist/models
# Clone HF weights
git lfs install
git clone https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1
cd ../..
# Convert weight
mlc_llm convert_weight ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC

步骤 2. 生成 mlc-chat-config 并编译

模型库由以下内容指定：

模型架构（例如 llama-2、gpt-neox）

量化（例如 q4f16_1、q0f32）

元数据（例如 context_window_size、sliding_window_size、prefill-chunk-size），这会影响内存规划

平台（例如 cuda、webgpu、iOS）

所有这些选项都在 gen_config 生成的 mlc-chat-config.json 中指定。

# 1. gen_config: generate mlc-chat-config.json and process tokenizers
mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 --conv-template redpajama_chat \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
# 2. compile: compile model library with specification in mlc-chat-config.json
mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
    --device webgpu -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-webgpu.wasm

备注

在编译较大的模型（如 Llama-3-8B）时，您可能希望添加 --prefill_chunk_size 1024 以减少内存使用。否则，在运行时可能会遇到以下问题：

TypeError: Failed to execute 'createBuffer' on 'GPUDevice': Failed to read the 'size' property from
'GPUBufferDescriptor': Value is outside the 'unsigned long long' value range.

步骤 3. 分发模型库和模型权重

完成上述步骤后，您应该得到以下内容：

~/mlc-llm > ls dist/libs
  RedPajama-INCITE-Chat-3B-v1-q4f16_1-webgpu.wasm  # ===> the model library

~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
  mlc-chat-config.json                             # ===> the chat config
  ndarray-cache.json                               # ===> the model weight info
  params_shard_0.bin                               # ===> the model weights
  params_shard_1.bin
  ...
  tokenizer.json                                   # ===> the tokenizer files
  tokenizer_config.json

将 RedPajama-INCITE-Chat-3B-v1-q4f16_1-webgpu.wasm 上传到 github 仓库（对于我们来说，它在 binary-mlc-llm-libs 中）。然后将 RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC 上传到 Huggingface 仓库：

# First, please create a repository on Hugging Face.
# With the repository created, run
git lfs install
git clone https://huggingface.co/my-huggingface-account/my-redpajama3b-weight-huggingface-repo
cd my-redpajama3b-weight-huggingface-repo
cp path/to/mlc-llm/dist/RedPajama-INCITE-Instruct-3B-v1-q4f16_1-MLC/* .
git add . && git commit -m "Add redpajama-3b instruct model weights"
git push origin main

这将生成类似于 RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC 的内容。

步骤 4. 注册为 ModelRecord

最后，能够在 WebLLM 的 get-started 中运行我们添加的模型：

const myAppConfig: AppConfig = {
  model_list: [
    // Other records here omitted...
    {
      "model": "https://huggingface.co/my-hf-account/my-redpajama3b-weight-huggingface-repo/resolve/main/",
      "model_id": "RedPajama-INCITE-Instruct-3B-v1",
      "model_lib": "https://raw.githubusercontent.com/my-gh-account/my-repo/main/RedPajama-INCITE-Chat-3B-v1-q4f16_1-webgpu.wasm",
      "required_features": ["shader-f16"],
    },
  ]
}

const selectedModel = "RedPajama-INCITE-Instruct-3B-v1";
const engine: webllm.MLCEngineInterface = await webllm.CreateMLCEngine(
  selectedModel,
  { appConfig: appConfig },
);

现在，运行 get-started 示例将使用您刚刚添加的 RedPajama 模型。请参阅 get-started 的 README 了解如何运行它。

WebLLM Javascript SDK¶

快速入门¶

WebLLM 中的模型记录¶

验证添加模型的安装¶

引入您自己的模型变体¶

引入您自己的模型库¶