Python API

备注

本页介绍了 MLC LLM 中 MLCEngine 的 Python API。

MLC LLM 通过类 mlc_llm.MLCEngine 和 mlc_llm.AsyncMLCEngine 提供 Python API，支持完整的 OpenAI API 兼容性，便于集成到其他 Python 项目中。

本页介绍了如何在 MLC LLM 中使用引擎。Python API 是 MLC-LLM 包的一部分，已通过安装页面准备了预构建的 pip 安装包。

验证安装 

python -c "from mlc_llm import MLCEngine; print(MLCEngine)"

您应该会看到 <class 'mlc_llm.serve.engine.MLCEngine'> 的输出。

如果上述命令导致错误，请按照安装 MLC LLM Python 包安装预构建的 pip 包或从源代码构建 MLC LLM。

运行 MLCEngine 

mlc_llm.MLCEngine 提供了同步的 OpenAI 聊天补全接口。由于同步设计，mlc_llm.MLCEngine 不会批量处理并发请求，请使用 AsyncMLCEngine 进行请求批量处理。

流式响应。 在快速上手和 MLC LLM 简介中，介绍了 mlc_llm.MLCEngine 的基本用法。

from mlc_llm import MLCEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

此代码示例首先使用 8B Llama-3 模型创建了 mlc_llm.MLCEngine 实例。设计的 Python API mlc_llm.MLCEngine 与 OpenAI API 对齐，这意味着您可以以与使用 OpenAI 的 Python 包相同的方式使用 mlc_llm.MLCEngine 进行同步和异步生成。

非流式响应。 上面的代码示例使用了同步聊天补全接口并遍历所有流式响应。如果您想在不使用流式的情况下运行，可以运行

response = engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=False,
)
print(response)

请参阅 OpenAI 的 Python 包和 OpenAI 聊天补全 API 以获取完整的聊天补全接口。

备注

如果您想在多个 GPU 上启用张量并行来运行 LLM，请在 MLCEngine 构造函数中指定参数 model_config_overrides。例如，

from mlc_llm import MLCEngine
from mlc_llm.serve.config import EngineConfig

model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(
    model,
    engine_config=EngineConfig(tensor_parallel_shards=2),
)

运行 AsyncMLCEngine 

mlc_llm.AsyncMLCEngine 提供了具有异步功能的 OpenAI 聊天补全接口。建议使用 mlc_llm.AsyncMLCEngine 来批量处理并发请求以获得更好的吞吐量。

流式响应。 mlc_llm.AsyncMLCEngine 用于流式响应的核心用法如下。

async for response in await engine.chat.completions.create(
  messages=[{"role": "user", "content": "What is the meaning of life?"}],
  model=model,
  stream=True,
):
  for choice in response.choices:
      print(choice.delta.content, end="", flush=True)

The collapsed is a complete runnable example of AsyncMLCEngine in Python.

import asyncio
from typing import Dict

from mlc_llm.serve import AsyncMLCEngine

model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
prompts = [
    "Write a three-day travel plan to Pittsburgh.",
    "What is the meaning of life?",
]


async def test_completion():
    # Create engine
    async_engine = AsyncMLCEngine(model=model)

    num_requests = len(prompts)
    output_texts: Dict[str, str] = {}

    async def generate_task(prompt: str):
        async for response in await async_engine.chat.completions.create(
            messages=[{"role": "user", "content": prompt}],
            model=model,
            stream=True,
        ):
            if response.id not in output_texts:
                output_texts[response.id] = ""
            output_texts[response.id] += response.choices[0].delta.content

    tasks = [asyncio.create_task(generate_task(prompts[i])) for i in range(num_requests)]
    await asyncio.gather(*tasks)

    # Print output.
    for request_id, output in output_texts.items():
        print(f"Output of request {request_id}:\n{output}\n")

    async_engine.terminate()


asyncio.run(test_completion())

非流式响应。 同样，mlc_llm.AsyncEngine 提供了非流式响应接口。

response = await engine.chat.completions.create(
  messages=[{"role": "user", "content": "What is the meaning of life?"}],
  model=model,
  stream=False,
)
print(response)

请参阅 OpenAI 的 Python 包和 OpenAI 聊天补全 API 以获取完整的聊天补全接口。

备注

如果您想在多个 GPU 上启用张量并行来运行 LLM，请在 AsyncMLCEngine 构造函数中指定参数 model_config_overrides。例如，

from mlc_llm import AsyncMLCEngine
from mlc_llm.serve.config import EngineConfig

model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = AsyncMLCEngine(
    model,
    engine_config=EngineConfig(tensor_parallel_shards=2),
)

Engine 模式 

为了简化引擎配置，mlc_llm.MLCEngine 和 mlc_llm.AsyncMLCEngine 的构造函数有可选参数 mode，它可以是 "local"、"interactive" 或 "server" 三种选项之一。默认模式是 "local"。

每种模式表示引擎的预定义配置，以满足不同的用例。模式的选择控制引擎的请求并发性，以及引擎的 KV 缓存 token 容量（换句话说，引擎的 KV 缓存可以容纳的最大 token 数量），并进一步影响引擎的 GPU 内存使用。

简而言之，

模式 "local" 使用低请求并发性和低 KV 缓存容量，适用于 并发请求不多，并且用户希望节省 GPU 内存使用 的情况。
模式 "interactive" 使用 1 作为请求并发性和低 KV 缓存容量，专为 交互式用例 （如聊天和对话）设计。
模式 "server" 使用尽可能多的请求并发性和 KV 缓存容量。此模式旨在 充分利用 GPU 内存，适用于并发请求可能较多的大型服务器场景。

对于系统基准测试，请选择模式 "server"。请参阅 API 参考以获取引擎模式的详细文档。

使用 Python API 部署您自己的模型 

介绍页面介绍了如何使用 MLC LLM 部署自己的模型。本节介绍如何在 mlc_llm.MLCEngine 和 mlc_llm.AsyncMLCEngine 中使用您转换的模型权重和构建的模型库。

以 Phi-2 作为示例模型。

指定模型权重路径。 假设您已经转换了自己的模型权重，您可以按如下方式构建 mlc_llm.MLCEngine：

from mlc_llm import MLCEngine

model = "models/phi-2"  # Assuming the converted phi-2 model weights are under "models/phi-2"
engine = MLCEngine(model)

指定模型库路径。 此外，如果您自己构建了模型库，可以通过参数 model_lib 传递库路径，在 mlc_llm.MLCEngine 中使用它。

from mlc_llm import MLCEngine

model = "models/phi-2"
model_lib = "models/phi-2/lib.so"  # Assuming the phi-2 model library is built at "models/phi-2/lib.so"
engine = MLCEngine(model, model_lib=model_lib)

这同样适用于 mlc_llm.AsyncMLCEngine。

API 参考 

mlc_llm.MLCEngine 和 mlc_llm.AsyncMLCEngine 类提供了以下构造函数。

MLCEngine 和 AsyncMLCEngine 具有完整的 OpenAI API 兼容性。请参阅 OpenAI 的 Python 包和 OpenAI 聊天补全 API 以获取完整的聊天补全接口。

Python API

验证安装

运行 MLCEngine

运行 AsyncMLCEngine

Engine 模式

使用 Python API 部署您自己的模型

API 参考