2023年6月1日 17:59:24go评论154阅读模式

英文:

Deployed dolly2 model in Sagemaker for embeddings, but receiving a 400 error when calling endpoint

问题

我已在Sagemaker中部署了dolly2模型，尝试为嵌入创建一些向量。代码在生成文本方面工作正常，但在将inference.py更改为处理嵌入后，出现了以下错误：

otocore.errorfactory.ModelError: 调用InvokeEndpoint操作时发生错误 (ModelError)：从主节点接收到客户端错误 (400)，消息为“{
"code": 400,
"type": "InternalServerException",
"message": "(\"You need to define one of the following ['audio-classification', 'automatic-speech-recognition', 'feature-extraction', 'text-classification', 'token-classification', 'question-answering', 'table-question-answering', 'visual-question-answering', 'document-question-answering', 'fill-mask', 'summarization', 'translation', 'text2text-generation', 'text-generation', 'zero-shot-classification', 'zero-shot-image-classification', 'conversational', 'image-classification', 'image-segmentation', 'image-to-text', 'object-detection', 'zero-shot-object-detection', 'depth-estimation', 'video-classification'] as env 'HF_TASK'.", 403)"
}


下面您还可以看到我用于嵌入的代码：
```python
import json
import os
import boto3
from transformers import pipeline
def invoke_sagemaker_endpoint():
    # 创建SageMaker客户端
    sagemaker_client = boto3.client("sagemaker-runtime")
    # 定义端点名称和有效负载
    endpoint_name = 'XXX'  # 替换为您的SageMaker端点名称
    payload = {"inputs": "这是一个大文档。"}  # 根据模型期望的有效负载格式更新
    # 将请求发送到SageMaker端点
    response = sagemaker_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
    )
    # 解析响应并提取嵌入向量
    response_body = response["Body"].read().decode("utf-8")
    response_json = json.loads(response_body)
    if "embeddings" in response_json:
        embeddings = response_json["embeddings"]
        embeddings_vector = embeddings[0]  # 嵌入以列表形式返回
        return embeddings_vector
    else:
        return None
if __name__ == "__main__":
    # 将HF_TASK环境变量设置为'feature-extraction'以获取嵌入
    os.environ["HF_TASK"] = "feature-extraction"
    # 调用SageMaker端点
    embeddings_vector = invoke_sagemaker_endpoint()
    if embeddings_vector:
        print(embeddings_vector)
    else:
        print("响应中未找到嵌入。")

以及inference.py：

import torch
from transformers import pipeline
def model_fn(model_dir):
    model = pipeline(
        "text-generation",
        model=model_dir,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
        model_kwargs={"load_in_8bit": True},
    )
    tokenizer = model.tokenizer
    embeddings_model = model.model
    def generate_embeddings(inputs):
        inputs = tokenizer(inputs, truncation=True, padding="longest", return_tensors="pt")
        with torch.no_grad():
            outputs = embeddings_model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1).squeeze(0).tolist()
        return embeddings
    def retrieve_qa(question, context):
        inputs = {"question": question, "context": context}
        qa_outputs = model(question, context)
        return qa_outputs
    return model, generate_embeddings, retrieve_qa

更改了推理（inference）和HF（Hugging Face），重新部署到Sagemaker，并从API Gateway调用，而不是从Sagemaker端点调用。


<details>
<summary>英文:</summary>
I have deployed the dolly2 model in sagemaker and I am trying to create some vectors for embeddings, the code works just fine for text generation but after changing the inference.py to handle embeddings, I am getting the error below

otocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "("You need to define one of the following [\u0027audio-classification\u0027, \u0027automatic-speech-recognition\u0027, \u0027feature-extraction\u0027, \u0027text-classification\u0027, \u0027token-classification\u0027, \u0027question-answering\u0027, \u0027table-question-answering\u0027, \u0027visual-question-answering\u0027, \u0027document-question-answering\u0027, \u0027fill-mask\u0027, \u0027summarization\u0027, \u0027translation\u0027, \u0027text2text-generation\u0027, \u0027text-generation\u0027, \u0027zero-shot-classification\u0027, \u0027zero-shot-image-classification\u0027, \u0027conversational\u0027, \u0027image-classification\u0027, \u0027image-segmentation\u0027, \u0027image-to-text\u0027, \u0027object-detection\u0027, \u0027zero-shot-object-detection\u0027, \u0027depth-estimation\u0027, \u0027video-classification\u0027] as env \u0027HF_TASK\u0027.", 403)"
}


Below you can also see the code that I am using for the embedding

import json
import os
import boto3
from transformers import pipeline

def invoke_sagemaker_endpoint():
# Create a SageMaker client
sagemaker_client = boto3.client("sagemaker-runtime")

# Define the endpoint name and payload
endpoint_name = &#39;XXX&#39;  # Replace with your SageMaker endpoint name
payload = {&quot;inputs&quot;: &quot;This is a large document.&quot;}  # Update the payload format as expected by the model
# Send the request to the SageMaker endpoint
response = sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType=&quot;application/json&quot;,
    Body=json.dumps(payload),
)
# Parse the response and extract the embeddings vector
response_body = response[&quot;Body&quot;].read().decode(&quot;utf-8&quot;)
response_json = json.loads(response_body)
if &quot;embeddings&quot; in response_json:
    embeddings = response_json[&quot;embeddings&quot;]
    embeddings_vector = embeddings[0]  # embeddings are returned as a list
    return embeddings_vector
else:
    return None

if name == "main":
# Set the HF_TASK environment variable to 'feature-extraction' for the embeddings
os.environ["HF_TASK"] = "feature-extraction"
# Invoke the SageMaker endpoint
embeddings_vector = invoke_sagemaker_endpoint()

if embeddings_vector:
    print(embeddings_vector)
else:
    print(&quot;No embeddings found in the response.&quot;)


and the inference.py

import torch
from transformers import pipeline

def model_fn(model_dir):
model = pipeline(
"text-generation",
model=model_dir,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
model_kwargs={"load_in_8bit": True},
)
tokenizer = model.tokenizer
embeddings_model = model.model

def generate_embeddings(inputs):
    inputs = tokenizer(inputs, truncation=True, padding=&quot;longest&quot;, return_tensors=&quot;pt&quot;)
    with torch.no_grad():
        outputs = embeddings_model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze(0).tolist()
    return embeddings
def retrieve_qa(question, context):
    inputs = {&quot;question&quot;: question, &quot;context&quot;: context}
    qa_outputs = model(question, context)
    return qa_outputs
return model, generate_embeddings, retrieve_qa


Changed the inference, the HF, redeployed to sagemaker, called from api gateway instead of sagemaker endpoint
</details>
# 答案1
**得分**: 1
嗨 [Arpel](https://stackoverflow.com/users/13038760/arpel)，看起来你正在混合两种不同的方法来部署 Hugging Face 模型作为 SageMaker 端点。
供日后参考，我看到你尝试设置 `HF_TASK` 环境变量，但是你正在调用 boto3 的实例上执行这个操作 — 这与将托管你的模型并执行推断的实例是_分开的_。请参考[此指南]()以获取关于在 SageMaker 和 HuggingFace 中进行**非自定义**推断的具体信息。
因为你想让模型执行两个任务 — 嵌入和问答 — 你正确地认识到你将需要一个自定义的 `inference.py` 文件。要采用这种方法，你需要执行以下步骤：
* 使用 git 克隆 Hugging Face 模型
* 创建一个 `code/` 目录（在模型目录内），并添加一个 `inference.py` 文件
* 在推断文件中包含两个函数，这些函数**必须**分别命名为 `model_fn()` 和 `predict_fn()`。前者仅在初始化端点时使用，必须返回模型和标记器，后者用于每个推断请求。你可以使用 `predict_fn()` 来包含自定义逻辑。
* 使用所有模型工件（包括自定义推断代码）创建一个 tarball（`model.tar.gz`）。其格式如下。
```plaintext
model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt

最后，将 tarball 上传到 S3，并在创建模型/端点时将 S3 URI 传递给 SageMaker。

Hugging Face 提供了一个很棒的笔记本，涵盖了整个过程。这是我迄今为止找到的最好的指南。如果你逐字复制它，只修改 inference.py 脚本，你应该能够成功。

以下是我之前使用过的 inference.py 的示例，如你所见，Hugging Face Pipelines 也可以使用！

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
from DirectQuoteUtils import reformat
import torch
import os
def model_fn(model_dir):
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForTokenClassification.from_pretrained(model_dir)
    pipe = pipeline("ner", model=model, tokenizer=tokenizer)
    return pipe
def predict_fn(data, pipeline):
    pipe = pipeline
    outputs = []
    
    # 模型输入的格式：
    # {               # 字符串列表
    #     "inputs": ["Donald Trump is the president of the US", "Joe Biden is the United States president"]
    # }
    
    modelData = pipe(data['inputs'])
    
    for prediction in modelData:
        cleanPred = reformat(prediction)
        outputs.append(cleanPred)
        
    return {
        # "device": device, # 用于检查是否使用了 CUDA 的有用信息
        "outputs": outputs
    }

英文:

Hey Arpel, it seems you are mixing two different methods of deploying Hugging Face models as SageMaker Endpoints.

For future reference, I see you have tried to set the HF_TASK environment variable, however, you're doing it on the instance used to call boto3 — this is separate from the instance that will host your model and perform inference. Follow this guide for the specifics on non-custom inference with SageMaker and HuggingFace.

Because you'd like the model to perform two tasks — embeddings and QA — you're correct in identifying that you'll need a custom inference.py file. To take this approach, you'll need to perform the following steps:

Clone the model from Hugging Face using git
Create a code/ directory (within the model dir) and add an inference.py file
Include two functions in the inference file, these must be called model_fn() and predict_fn(). The former is used only when the endpoint is initialised, and must return the model & tokeniser, the latter is called for each inference request. You can use the predict_fn() to include custom logic.
Create a tarball (model.tar.gz) with all the model artefacts (incl. your custom inference code). It should be formatted as below.

model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt

Finally, upload the tarball to S3 and pass the S3 URI to SageMaker when creating a model/endpoint.

There's a great notebook from Hugging Face covering this whole process. It's the best guide I've been able to find so far. If you copy it word-for-word, and only modify the inference.py script, you should be successful.

Here's an example of an inference.py I've used previously, as you can see, Hugging Face Pipelines work too!

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
from DirectQuoteUtils import reformat
import torch
import os
def model_fn(model_dir):
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForTokenClassification.from_pretrained(model_dir)
    pipe = pipeline(&quot;ner&quot;, model=model, tokenizer=tokenizer)
    return pipe
def predict_fn(data, pipeline):
    pipe = pipeline
    outputs = []
    
    # FORMAT FOR MODEL INPUT:
    # {               # list of strings
    #     &quot;inputs&quot;: [&quot;Donald Trump is the president of the US&quot;, &quot;Joe Biden is the United States president&quot;]
    # }
    
    modelData = pipe(data[&#39;inputs&#39;])
    
    for prediction in modelData:
        cleanPred = reformat(prediction)
        outputs.append(cleanPred)
        
    return {
        # &quot;device&quot;: device, # handy to check if CUDA is being used
        &quot;outputs&quot;: outputs
    }

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Deployed dolly2 model in Sagemaker for embeddings, but receiving a 400 error when calling endpoint

问题

如何销毁一个 QApplication，然后在不退出 Python 脚本的情况下运行一个新的？

print(“print(“Hello World!”) doesn’t work”)

乘以包含矩阵的张量，遵循矩阵乘法规则。

Transform dataframe Python

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。