Deployed dolly2 model in Sagemaker for embeddings, but receiving a 400 error when calling endpoint

huangapple go评论126阅读模式
英文:

Deployed dolly2 model in Sagemaker for embeddings, but receiving a 400 error when calling endpoint

问题

我已在Sagemaker中部署了dolly2模型,尝试为嵌入创建一些向量。代码在生成文本方面工作正常,但在将inference.py更改为处理嵌入后,出现了以下错误:

otocore.errorfactory.ModelError: 调用InvokeEndpoint操作时发生错误 (ModelError):从主节点接收到客户端错误 (400),消息为“{
"code": 400,
"type": "InternalServerException",
"message": "(\"You need to define one of the following ['audio-classification', 'automatic-speech-recognition', 'feature-extraction', 'text-classification', 'token-classification', 'question-answering', 'table-question-answering', 'visual-question-answering', 'document-question-answering', 'fill-mask', 'summarization', 'translation', 'text2text-generation', 'text-generation', 'zero-shot-classification', 'zero-shot-image-classification', 'conversational', 'image-classification', 'image-segmentation', 'image-to-text', 'object-detection', 'zero-shot-object-detection', 'depth-estimation', 'video-classification'] as env 'HF_TASK'.", 403)"
}


下面您还可以看到我用于嵌入的代码:

```python
import json
import os
import boto3
from transformers import pipeline

def invoke_sagemaker_endpoint():
    # 创建SageMaker客户端
    sagemaker_client = boto3.client("sagemaker-runtime")

    # 定义端点名称和有效负载
    endpoint_name = 'XXX'  # 替换为您的SageMaker端点名称
    payload = {"inputs": "这是一个大文档。"}  # 根据模型期望的有效负载格式更新

    # 将请求发送到SageMaker端点
    response = sagemaker_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
    )

    # 解析响应并提取嵌入向量
    response_body = response["Body"].read().decode("utf-8")
    response_json = json.loads(response_body)

    if "embeddings" in response_json:
        embeddings = response_json["embeddings"]
        embeddings_vector = embeddings[0]  # 嵌入以列表形式返回
        return embeddings_vector
    else:
        return None

if __name__ == "__main__":
    # 将HF_TASK环境变量设置为'feature-extraction'以获取嵌入
    os.environ["HF_TASK"] = "feature-extraction"
    # 调用SageMaker端点
    embeddings_vector = invoke_sagemaker_endpoint()

    if embeddings_vector:
        print(embeddings_vector)
    else:
        print("响应中未找到嵌入。")

以及inference.py:

import torch
from transformers import pipeline

def model_fn(model_dir):
    model = pipeline(
        "text-generation",
        model=model_dir,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
        model_kwargs={"load_in_8bit": True},
    )
    tokenizer = model.tokenizer
    embeddings_model = model.model

    def generate_embeddings(inputs):
        inputs = tokenizer(inputs, truncation=True, padding="longest", return_tensors="pt")
        with torch.no_grad():
            outputs = embeddings_model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1).squeeze(0).tolist()
        return embeddings

    def retrieve_qa(question, context):
        inputs = {"question": question, "context": context}
        qa_outputs = model(question, context)
        return qa_outputs

    return model, generate_embeddings, retrieve_qa

更改了推理(inference)和HF(Hugging Face),重新部署到Sagemaker,并从API Gateway调用,而不是从Sagemaker端点调用。


<details>
<summary>英文:</summary>

I have deployed the dolly2 model in sagemaker and I am trying to create some vectors for embeddings, the code works just fine for text generation but after changing the inference.py to handle embeddings, I am getting the error below



otocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "(&quot;You need to define one of the following [\u0027audio-classification\u0027, \u0027automatic-speech-recognition\u0027, \u0027feature-extraction\u0027, \u0027text-classification\u0027, \u0027token-classification\u0027, \u0027question-answering\u0027, \u0027table-question-answering\u0027, \u0027visual-question-answering\u0027, \u0027document-question-answering\u0027, \u0027fill-mask\u0027, \u0027summarization\u0027, \u0027translation\u0027, \u0027text2text-generation\u0027, \u0027text-generation\u0027, \u0027zero-shot-classification\u0027, \u0027zero-shot-image-classification\u0027, \u0027conversational\u0027, \u0027image-classification\u0027, \u0027image-segmentation\u0027, \u0027image-to-text\u0027, \u0027object-detection\u0027, \u0027zero-shot-object-detection\u0027, \u0027depth-estimation\u0027, \u0027video-classification\u0027] as env \u0027HF_TASK\u0027.&quot;, 403)"
}


Below you can also see the code that I am using for the embedding 

import json
import os
import boto3
from transformers import pipeline

def invoke_sagemaker_endpoint():
# Create a SageMaker client
sagemaker_client = boto3.client("sagemaker-runtime")

# Define the endpoint name and payload
endpoint_name = &#39;XXX&#39;  # Replace with your SageMaker endpoint name
payload = {&quot;inputs&quot;: &quot;This is a large document.&quot;}  # Update the payload format as expected by the model

# Send the request to the SageMaker endpoint
response = sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType=&quot;application/json&quot;,
    Body=json.dumps(payload),
)

# Parse the response and extract the embeddings vector
response_body = response[&quot;Body&quot;].read().decode(&quot;utf-8&quot;)
response_json = json.loads(response_body)

if &quot;embeddings&quot; in response_json:
    embeddings = response_json[&quot;embeddings&quot;]
    embeddings_vector = embeddings[0]  # embeddings are returned as a list
    return embeddings_vector
else:
    return None

if name == "main":
# Set the HF_TASK environment variable to 'feature-extraction' for the embeddings
os.environ["HF_TASK"] = "feature-extraction"
# Invoke the SageMaker endpoint
embeddings_vector = invoke_sagemaker_endpoint()

if embeddings_vector:
    print(embeddings_vector)
else:
    print(&quot;No embeddings found in the response.&quot;)


and the inference.py 


import torch
from transformers import pipeline

def model_fn(model_dir):
model = pipeline(
"text-generation",
model=model_dir,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
model_kwargs={"load_in_8bit": True},
)
tokenizer = model.tokenizer
embeddings_model = model.model

def generate_embeddings(inputs):
    inputs = tokenizer(inputs, truncation=True, padding=&quot;longest&quot;, return_tensors=&quot;pt&quot;)
    with torch.no_grad():
        outputs = embeddings_model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze(0).tolist()
    return embeddings

def retrieve_qa(question, context):
    inputs = {&quot;question&quot;: question, &quot;context&quot;: context}
    qa_outputs = model(question, context)
    return qa_outputs

return model, generate_embeddings, retrieve_qa


Changed the inference, the HF, redeployed to sagemaker, called from api gateway instead of sagemaker endpoint

</details>


# 答案1
**得分**: 1

嗨 [Arpel](https://stackoverflow.com/users/13038760/arpel),看起来你正在混合两种不同的方法来部署 Hugging Face 模型作为 SageMaker 端点。

供日后参考,我看到你尝试设置 `HF_TASK` 环境变量,但是你正在调用 boto3 的实例上执行这个操作 — 这与将托管你的模型并执行推断的实例是_分开的_。请参考[此指南]()以获取关于在 SageMaker 和 HuggingFace 中进行**非自定义**推断的具体信息。

因为你想让模型执行两个任务 — 嵌入和问答 — 你正确地认识到你将需要一个自定义的 `inference.py` 文件。要采用这种方法,你需要执行以下步骤:

* 使用 git 克隆 Hugging Face 模型
* 创建一个 `code/` 目录(在模型目录内),并添加一个 `inference.py` 文件
* 在推断文件中包含两个函数,这些函数**必须**分别命名为 `model_fn()` 和 `predict_fn()`。前者仅在初始化端点时使用,必须返回模型和标记器,后者用于每个推断请求。你可以使用 `predict_fn()` 来包含自定义逻辑。
* 使用所有模型工件(包括自定义推断代码)创建一个 tarball(`model.tar.gz`)。其格式如下。

```plaintext
model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt 
  • 最后,将 tarball 上传到 S3,并在创建模型/端点时将 S3 URI 传递给 SageMaker。

Hugging Face 提供了一个很棒的笔记本,涵盖了整个过程。这是我迄今为止找到的最好的指南。如果你逐字复制它,只修改 inference.py 脚本,你应该能够成功。

以下是我之前使用过的 inference.py 的示例,如你所见,Hugging Face Pipelines 也可以使用!

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
from DirectQuoteUtils import reformat
import torch
import os

def model_fn(model_dir):
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForTokenClassification.from_pretrained(model_dir)
    pipe = pipeline("ner", model=model, tokenizer=tokenizer)
    return pipe

def predict_fn(data, pipeline):
    pipe = pipeline
    outputs = []
    
    # 模型输入的格式:
    # {               # 字符串列表
    #     "inputs": ["Donald Trump is the president of the US", "Joe Biden is the United States president"]
    # }
    
    modelData = pipe(data['inputs'])
    
    for prediction in modelData:
        cleanPred = reformat(prediction)
        outputs.append(cleanPred)
        
    return {
        # "device": device, # 用于检查是否使用了 CUDA 的有用信息
        "outputs": outputs
    }
英文:

Hey Arpel, it seems you are mixing two different methods of deploying Hugging Face models as SageMaker Endpoints.

For future reference, I see you have tried to set the HF_TASK environment variable, however, you're doing it on the instance used to call boto3 — this is separate from the instance that will host your model and perform inference. Follow this guide for the specifics on non-custom inference with SageMaker and HuggingFace.

Because you'd like the model to perform two tasks — embeddings and QA — you're correct in identifying that you'll need a custom inference.py file. To take this approach, you'll need to perform the following steps:

  • Clone the model from Hugging Face using git
  • Create a code/ directory (within the model dir) and add an inference.py file
  • Include two functions in the inference file, these must be called model_fn() and predict_fn(). The former is used only when the endpoint is initialised, and must return the model & tokeniser, the latter is called for each inference request. You can use the predict_fn() to include custom logic.
  • Create a tarball (model.tar.gz) with all the model artefacts (incl. your custom inference code). It should be formatted as below.
model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt 
  • Finally, upload the tarball to S3 and pass the S3 URI to SageMaker when creating a model/endpoint.

There's a great notebook from Hugging Face covering this whole process. It's the best guide I've been able to find so far. If you copy it word-for-word, and only modify the inference.py script, you should be successful.

Here's an example of an inference.py I've used previously, as you can see, Hugging Face Pipelines work too!

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
from DirectQuoteUtils import reformat
import torch
import os

def model_fn(model_dir):
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForTokenClassification.from_pretrained(model_dir)
    pipe = pipeline(&quot;ner&quot;, model=model, tokenizer=tokenizer)
    return pipe

def predict_fn(data, pipeline):
    pipe = pipeline
    outputs = []
    
    # FORMAT FOR MODEL INPUT:
    # {               # list of strings
    #     &quot;inputs&quot;: [&quot;Donald Trump is the president of the US&quot;, &quot;Joe Biden is the United States president&quot;]
    # }
    
    modelData = pipe(data[&#39;inputs&#39;])
    
    for prediction in modelData:
        cleanPred = reformat(prediction)
        outputs.append(cleanPred)
        
    return {
        # &quot;device&quot;: device, # handy to check if CUDA is being used
        &quot;outputs&quot;: outputs
    }

huangapple
  • 本文由 发表于 2023年6月1日 17:59:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76380739.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定