2023年6月15日 00:04:46go评论115阅读模式

英文:

How does `enforce_stop_tokens` work in LangChain with Huggingface models?

问题

In the code you provided, you can use the following tokens to enforce stop tokens for the HuggingFace model:

stop = ["」\n\n「", "」\n\n", "」\n\nWhile"]

These tokens are used to split the generated text at the point where the generation ends.

英文:

When we look at HuggingFaceHub model usage in langchain there's this part that the author doesn't know how to stop the generation, https://github.com/hwchase17/langchain/blob/master/langchain/llms/huggingface_pipeline.py#L182:

class HuggingFacePipeline(LLM):
        ...
    def _call(
        ...
        if stop is not None:
            # This is a bit hacky, but I can&#39;t figure out a better way to enforce
            # stop tokens when making calls to huggingface_hub.
            text = enforce_stop_tokens(text, stop)
        return text

What should I use to add the stop token to the end of the template?

If we look at https://github.com/hwchase17/langchain/blob/master/langchain/llms/utils.py, it's simply a regex split that split an input string up based on a list of stopwords, then take the first partition of the re.split

re.split(&quot;|&quot;.join(stop), text)[0]

Lets try to get a generation output from a Huggingface model, e.g.

from transformers import pipeline
from transformers import GPT2LMHeadModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(&#39;gpt2&#39;)
model = GPT2LMHeadModel.from_pretrained(&#39;gpt2&#39;)
generator = pipeline(&#39;text-generation&#39;, model=model, tokenizer=tokenizer)
output = generator(&quot;Hey Pizza! &quot;)
output

[out]:

[{&#39;generated_text&#39;: &#39;Hey Pizza! 」\n\n「Hurry up, leave the place! 」\n\n「Oi! 」\n\nWhile eating pizza and then, Yuigahama came in contact with Ruriko in the middle of the&#39;}]

If we apply the re.split:

import re
def enforce_stop_tokens(text, stop):
    &quot;&quot;&quot;Cut off the text as soon as any stop words occur.&quot;&quot;&quot;
    return re.split(&quot;|&quot;.join(stop), text)[0]
stop = [&quot;up&quot;, &quot;then&quot;]
text = output[0][&#39;generated_text&#39;]
re.split(&quot;|&quot;.join(stop), text)

[out]:

[&#39;Hey Pizza! 」\n\n「Hurry &#39;,
 &#39;, leave the place! 」\n\n「Oi! 」\n\nWhile eating pizza and &#39;,
 &#39;, Yuigahama came in contact with Ruriko in the middle of the&#39;]

But that isn't useful, I want to split at the point the generation ends. What tokens do I use to "enforce_stop_tokens"?

答案1

得分: 1

你可以通过将 eos_token_id 设置为停止词来实现这一点，我的测试中似乎可以使用一个列表。如下所示：正则表达式截取停用词，eos_token_id 在停用词后立即截断（"once upon a time" 与 "once upon a"）。

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import regex as re
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# 定义您的自定义停用词
stop_terms = ["right", "time"]
# 确保停用词在分词器的词汇表中
for term in stop_terms:
    if term not in tokenizer.get_vocab():
        tokenizer.add_tokens([term])
        model.resize_token_embeddings(len(tokenizer))
def enforce_stop_tokens(text, stop):
    """一旦出现任何停用词，就截断文本。"""
    return re.split("|".join(stop), text)[0]
# 获取自定义停用词的令牌 ID
eos_token_ids_custom = [tokenizer.encode(term, add_prefix_space=True)[0] for term in stop_terms]
# 生成文本
input_text = "Once upon "
input_ids = tokenizer.encode(input_text, return_tensors='pt')
output_ids = model.generate(input_ids, eos_token_id=eos_token_ids_custom, max_length=50)
# 将输出 ID 解码为文本
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text) # Once upon a time
print("ENFORCE STOP TOKENS")
truncated_text = enforce_stop_tokens(generated_text, stop_terms)
print(truncated_text) # Once upon a

希望这对你有帮助。

英文:

You could do this by setting the eos_token_id as your stop term(s)-- in my testing it seemed to work with a list. See below: regex cuts off the stopword, eos_token_id cuts off just after the stopword ("once upon a time" vs. "once upon a")


from transformers import GPT2LMHeadModel, GPT2Tokenizer
import regex as re
tokenizer = GPT2Tokenizer.from_pretrained(&#39;gpt2&#39;)
model = GPT2LMHeadModel.from_pretrained(&#39;gpt2&#39;)
# Define your custom stop terms
stop_terms = [ &quot;right&quot;, &quot;time&quot;]
# Ensure the stop terms are in the tokenizer&#39;s vocabulary
for term in stop_terms:
    if term not in tokenizer.get_vocab():
        tokenizer.add_tokens([term])
        model.resize_token_embeddings(len(tokenizer))
def enforce_stop_tokens(text, stop):
    &quot;&quot;&quot;Cut off the text as soon as any stop words occur.&quot;&quot;&quot;
    return re.split(&quot;|&quot;.join(stop), text)[0]
# Get the token IDs for your custom stop terms
eos_token_ids_custom = [tokenizer.encode(term, add_prefix_space=True)[0] for term in stop_terms]
# Generate text
input_text = &quot;Once upon &quot;
input_ids = tokenizer.encode(input_text, return_tensors=&#39;pt&#39;)
output_ids = model.generate(input_ids, eos_token_id=eos_token_ids_custom, max_length=50)
# Decode the output IDs to text
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text) # Once upon a time
print(&quot;ENFORCE STOP TOKENS&quot;)
truncated_text = enforce_stop_tokens(generated_text, stop_terms)
print(truncated_text) # Once upon a

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

`enforce_stop_tokens` 在使用 Huggingface 模型的 LangChain 中是如何工作的？

问题

答案1

I am using langchain to chat with my database I want json format as output which includes fieldname as key

如何从文件对象列表中加载并拆分数据。

Langchain：自定义输出解析器在与ConversationChain一起使用时无法正常工作。

How do i add memory to RetrievalQA.from_chain_type? or, how do I add a custom prompt to ConversationalRetrievalChain?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。