JSON 生成¶
XGrammar 实现了高效的结构化生成。示例结构是 JSON 和 JSON Schema。在本教程中,将探讨如何使用 XGrammar 来确保 LLM 的输出是有效的 JSON,或遵循自定义的 JSON 模式。
首先探讨了如何在 LLM 引擎中使用 XGrammar 来实现这一点,在 LLM引擎中的JSON生成 中进行了介绍,然后在 通过 HF Transformers 尝试 中提供了使用 XGrammar 与 HF transformers
的端到端 JSON 生成示例。
安装 XGrammar¶
XGrammar 可以通过 pip 安装。建议始终在独立的 conda 虚拟环境中安装它。
LLM引擎中的JSON生成¶
在本节中,将了解如何在 LLM 引擎中使用 XGrammar 来确保输出始终是有效的 JSON。
以下所有代码片段均为实际可运行的代码,因为模拟了 LLM 生成过程。
首先,为本教程导入必要的库。
import xgrammar as xgr
import torch
import numpy as np
from transformers import AutoTokenizer, AutoConfig
接着,从正在使用的大型语言模型(LLM)中提取分词器信息,使用 xgr.TokenizerInfo
。有了 tokenizer_info
之后,实例化 xgr.GrammarCompiler
,它将编译您选择的语法。
# Get tokenizer info
model_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
config = AutoConfig.from_pretrained(model_id)
# This can be larger than tokenizer.vocab_size due to paddings
full_vocab_size = config.vocab_size
tokenizer_info = xgr.TokenizerInfo.from_huggingface(tokenizer, vocab_size=full_vocab_size)
compiler = xgr.GrammarCompiler(tokenizer_info, max_threads=8)
对于 JSON 生成,通常有三种编译语法的选项:使用内置的 JSON 语法,通过 Pydantic 模型指定 JSON 模式,或从 JSON 模式字符串中编译。从以下三种中选择一种来运行。
# Option 1: Compile with a built-in JSON grammar
compiled_grammar: xgr.CompiledGrammar = compiler.compile_builtin_json_grammar()
# Option 2: Compile with JSON schema from a pydantic model
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: int
compiled_grammar = compiler.compile_json_schema(Person)
# Option 3: Compile with JSON schema from a JSON schema string
import json
person_schema = {
"title": "Person",
"type": "object",
"properties": {
"name": {
"type": "string"
},
"age": {
"type": "integer",
}
},
"required": ["name", "age"]
}
compiled_grammar = compiler.compile_json_schema(json.dumps(person_schema))
利用编译好的语法,可以实例化 xgr.GrammarMatcher
,这是我们与之交互的主要结构,它维护着结构化生成的状态。还会分配位掩码,该掩码将用于对逻辑单元(logits)进行掩码处理。
# Instantiate grammar matcher and allocate the bitmask
matcher = xgr.GrammarMatcher(compiled_grammar)
token_bitmask = xgr.allocate_token_bitmask(1, tokenizer_info.vocab_size)
现在模拟单请求的自回归生成过程。关于批量推理,请参阅 与 LLM 引擎的集成。
# Here we simulate a valid sampled response
sim_sampled_response = '{"name": "xgrammar", "age": 0}<|end_of_text|>'
sim_sampled_token_ids = tokenizer.encode(sim_sampled_response, add_special_tokens=False)
# Each loop iteration is a simulated auto-regressive step
for i, sim_token_id in enumerate(sim_sampled_token_ids):
# LLM inference to get logits, here we use randn to simulate.
# logits is a tensor of shape (full_vocab_size,) on GPU
# logits = LLM.inference()
logits = torch.randn(full_vocab_size).cuda()
# Apply bitmask to logits to mask invalid tokens
matcher.fill_next_token_bitmask(token_bitmask)
xgr.apply_token_bitmask_inplace(logits, token_bitmask.to(logits.device))
# Sample next token
probs = torch.softmax(logits, dim=-1).cpu().numpy()
next_token_id = np.random.choice(list(range(full_vocab_size)), p=probs)
# Accept token from matcher to update its state, so that the next bitmask
# generated will enforce the next token to be generated. Assert to make
# sure the token is indeed valid. Here we accept the simulated response
# assert matcher.accept_token(next_token_id)
assert matcher.accept_token(sim_token_id)
# Since we accepted a stop token `<|end_of_text|>`, we have terminated
assert matcher.is_terminated()
# Reset to be ready for the next auto-regressive generation
matcher.reset()
尝试通过 HF Transformers 进行实验¶
XGrammar 可以轻松地通过 LogitsProcessor
与 Hugging Face Transformers 集成。请注意,这种集成主要是为了提升可访问性,可能会包含额外的开销。
首先,实例化模型、分词器以及输入数据。
import xgrammar as xgr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
device = "cuda" # Or "cpu", etc.
model_name = "meta-llama/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float32, device_map=device
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Introduce yourself in JSON with two fields: name and age."},
]
texts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer(texts, return_tensors="pt").to(model.device)
然后构建 GrammarCompiler
并编译语法。
tokenizer_info = xgr.TokenizerInfo.from_huggingface(tokenizer, vocab_size=config.vocab_size)
grammar_compiler = xgr.GrammarCompiler(tokenizer_info)
# Option 1: Compile with a built-in JSON grammar
# compiled_grammar = grammar_compiler.compile_builtin_json_grammar()
# Option 2: Compile with JSON schema from a pydantic model
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: int
compiled_grammar = grammar_compiler.compile_json_schema(Person)
最后,使用 LogitsProcessor
结合语法进行生成。
xgr_logits_processor = xgr.contrib.hf.LogitsProcessor(compiled_grammar)
generated_ids = model.generate(
**model_inputs, max_new_tokens=512, logits_processor=[xgr_logits_processor]
)
generated_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(generated_ids, skip_special_tokens=True))