EBNF引导的生成¶
XGrammar 支持高效的结构化生成。除了 JSON 之外,你还可以使用 EBNF 语法来指导生成过程,从而提供更大的定制灵活性。
首先回顾如何在 LLM 引擎中使用 XGrammar 来实现这一功能,详见 LLM引擎中的EBNF引导生成,随后提供了端到端的示例,展示如何结合 HF 的 transformers
库使用 XGrammar 进行 JSON 生成,详见 通过HF Transformers尝试。
安装 XGrammar¶
XGrammar 可以通过 pip 安装。建议始终在独立的 conda 虚拟环境中安装它。
LLM 引擎中的 EBNF 引导生成¶
在本节中,将探讨如何在 LLM 引擎中使用 XGrammar,以确保输出遵循 EBNF 语法。
以下所有代码片段均为实际可运行的代码,因为模拟了 LLM 生成过程。
首先,为本教程导入必要的库。
import xgrammar as xgr
import torch
import numpy as np
from transformers import AutoTokenizer, AutoConfig
接着,从正在使用的大型语言模型(LLM)中提取分词器信息,使用 xgr.TokenizerInfo
。有了 tokenizer_info
之后,实例化 xgr.GrammarCompiler
,它将编译您选择的语法。
# Get tokenizer info
model_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
config = AutoConfig.from_pretrained(model_id)
# This can be larger than tokenizer.vocab_size due to paddings
full_vocab_size = config.vocab_size
tokenizer_info = xgr.TokenizerInfo.from_huggingface(tokenizer, vocab_size=full_vocab_size)
compiler = xgr.GrammarCompiler(tokenizer_info, max_threads=8)
随后,指定 EBNF 语法字符串。目前采用 GBNF 格式(GGML BNF),其规范可参见 此处.。
ebnf_grammar_str = """root ::= (expr "=" term)+
expr ::= term ([-+*/] term)*
term ::= num | "(" expr ")"
num ::= [0-9]+"""
compiled_grammar = compiler.compile_grammar(ebnf_grammar_str)
利用编译好的语法,可以实例化 xgr.GrammarMatcher
,这是我们与之交互的主要结构,它维护着结构化生成的状态。还会分配位掩码,该掩码将用于对逻辑单元(logits)进行掩码处理。
# Instantiate grammar matcher and allocate the bitmask
matcher = xgr.GrammarMatcher(compiled_grammar)
token_bitmask = xgr.allocate_token_bitmask(1, tokenizer_info.vocab_size)
现在模拟单请求的自回归生成过程。关于批量推理,请参阅 与 LLM 引擎的集成。
# Here we simulate a valid sampled response
sim_sampled_response = '(5+3)*2=16<|end_of_text|>'
sim_sampled_token_ids = tokenizer.encode(sim_sampled_response, add_special_tokens=False)
# Each loop iteration is a simulated auto-regressive step
for i, sim_token_id in enumerate(sim_sampled_token_ids):
# LLM inference to get logits, here we use randn to simulate.
# logits is a tensor of shape (full_vocab_size,) on GPU
# logits = LLM.inference()
logits = torch.randn(full_vocab_size).cuda()
# Apply bitmask to logits to mask invalid tokens
matcher.fill_next_token_bitmask(token_bitmask)
xgr.apply_token_bitmask_inplace(logits, token_bitmask.to(logits.device))
# Sample next token
probs = torch.softmax(logits, dim=-1).cpu().numpy()
next_token_id = np.random.choice(list(range(full_vocab_size)), p=probs)
# Accept token from matcher to update its state, so that the next bitmask
# generated will enforce the next token to be generated. Assert to make
# sure the token is indeed valid. Here we accept the simulated response
# assert matcher.accept_token(next_token_id)
assert matcher.accept_token(sim_token_id)
# Since we accepted a stop token `<|end_of_text|>`, we have terminated
assert matcher.is_terminated()
# Reset to be ready for the next auto-regressive generation
matcher.reset()
尝试通过 HF Transformers 进行实验¶
XGrammar 可以轻松地通过 LogitsProcessor
与 Hugging Face Transformers 集成。请注意,这种集成主要是为了提升可访问性,可能会包含额外的开销。
首先,实例化模型、分词器以及输入数据。
import xgrammar as xgr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
device = "cuda" # Or "cpu", etc.
model_name = "meta-llama/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float32, device_map=device
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Introduce yourself in JSON briefly."},
]
texts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer(texts, return_tensors="pt").to(model.device)
然后构建 GrammarCompiler
并编译语法。
tokenizer_info = xgr.TokenizerInfo.from_huggingface(tokenizer, vocab_size=config.vocab_size)
grammar_compiler = xgr.GrammarCompiler(tokenizer_info)
# Grammar string that represents a JSON schema
json_grammar_ebnf_str = r"""
root ::= basic_array | basic_object
basic_any ::= basic_number | basic_string | basic_boolean | basic_null | basic_array | basic_object
basic_integer ::= ("0" | "-"? [1-9] [0-9]*) ".0"?
basic_number ::= ("0" | "-"? [1-9] [0-9]*) ("." [0-9]+)? ([eE] [+-]? [0-9]+)?
basic_string ::= (([\"] basic_string_1 [\"]))
basic_string_1 ::= "" | [^"\\\x00-\x1F] basic_string_1 | "\\" escape basic_string_1
escape ::= ["\\/bfnrt] | "u" [A-Fa-f0-9] [A-Fa-f0-9] [A-Fa-f0-9] [A-Fa-f0-9]
basic_boolean ::= "true" | "false"
basic_null ::= "null"
basic_array ::= "[" ("" | ws basic_any (ws "," ws basic_any)*) ws "]"
basic_object ::= "{" ("" | ws basic_string ws ":" ws basic_any ( ws "," ws basic_string ws ":" ws basic_any)*) ws "}"
ws ::= [ \n\t]*
"""
compiled_grammar = grammar_compiler.compile_grammar(json_grammar_ebnf_str)
最后,使用 LogitsProcessor
结合语法进行生成。
xgr_logits_processor = xgr.contrib.hf.LogitsProcessor(compiled_grammar)
generated_ids = model.generate(
**model_inputs, max_new_tokens=512, logits_processor=[xgr_logits_processor]
)
generated_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(generated_ids, skip_special_tokens=True))