英文:
Vertex AI Pipelines. Batch Prediction 'Error state: 5.'
问题
I have translated the code and the information provided. If you have any specific questions or need assistance with this code, please feel free to ask.
英文:
I have been trying to run a Vertex AI pipeline using Kubeflow Pipelines and the google-cloud-pipeline-components library. The pipeline is entirely custom container components with the exception of the batch predictions.
The code for my pipeline is of the following form:
# GCP infrastructure resources
from google.cloud import aiplatform, storage
from google_cloud_pipeline_components import aiplatform as gcc_aip
# kubeflow resources
import kfp
from kfp.v2 import dsl, compiler
from kfp.v2.dsl import component, pipeline
train_container_uri = '<insert custom docker image in gcr for training code>'
@pipeline(name="<pipeline name>", pipeline_root=pipeline_root_path)
def my_ml_pipeline():
# run the preprocessing workflow using a custom kfp component and get the outputs
preprocess_op = preprocess_component()
train_path, test_path = preprocess_op.outputs['Train Data'], preprocess_op.outputs['Test Data']
# path to string for gcs uri containing train data
train_path_text = preprocess_op.outputs['Train Data GCS Path']
# create training dataset on Vertex AI from the preprocessing outputs
train_set_op = gcc_aip.TabularDatasetCreateOp(
project='<insert gcs project id>',
display_name='<insert display name>', location='us-west1',
gcs_source = train_path_text
)
train_set = train_set_op.outputs['dataset']
# custom training op
training_op = gcc_aip.CustomContainerTrainingJobRunOp(
project='<insert gcp project id>',
display_name='<insert display name>',
location='us-west1',
dataset=train_set,
container_uri=train_container_uri,
staging_bucket=bucket_name,
model_serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/tf2-gpu.2-11:latest',
model_display_name='<insert model name>',
machine_type='n1-standard-4')
model_output = training_op.outputs['model']
# batch prediction op
batch_prediction_op = gcc_aip.ModelBatchPredictOp(
project='<insert gcp project id>',
job_display_name='<insert name of job>',
location='us-west1',
model=model_output,
gcs_source_uris=['gs://<bucket name>/<directory>/name_of_file.csv'],
instances_format='csv',
gcs_destination_output_uri_prefix='gs://<bucket name>/<directory>/',
machine_type='n1-standard-4',
accelerator_count=2,
accelerator_type='NVIDIA_TESLA_P100')
(For security and non-disclosure reasons, I can't input any specific paths or gcp projects, just trust that I inputted those correctly)
My initial preprocessing and training components seem to work fine (model is uploaded to registry, training job succeeded, preprocessed data appears in GCS buckets as seemingly necessary). However, my pipeline fails to finish when it gets to the batch prediction phase.
The error log terminates with the following error:
ValueError: Job failed with value error in error state: 5.
Additionally, I have a picture of the logs (the traceback only contains references to the google-cloud-pipeline-components
library, none of my specific code). This is seen here:
This error is presumably within the scope of the ModelBatchPredictOp()
method.
I don't even know where to begin, but could anyone give any pointers as to what error state 5 means? I know it's a ValueError
so it must have received either an invalid value in the method or in the model. However, I've ran the model on the exact same dataset locally, so I assume that it is an invalid input into the method. However, I have checked every input into the ModelBatchPredictOp()
. Has anyone gotten this error state before? Any help is appreciated.
Using google-cloud-pipeline-components==1.0.42
, google-cloud-aiplatform==1.24.1
, kfp==1.8.18
. My model is trained on TensorFlow 2.11.1, Python 3.10 in both my custom docker images and the script used to run the pipeline. Thank you in advance!
Edit 1 (2023-05-10):
I've looked up on the GitHub repo, it seems that my ValueError
has the following description:
// Some requested entity (e.g., file or directory) was not found.
//
// Note to server developers: if a request is denied for an entire class
// of users, such as gradual feature rollout or undocumented allowlist,
// `NOT_FOUND` may be used. If a request is denied for some users within
// a class of users, such as user-based access control, `PERMISSION_DENIED`
// must be used.
//
// HTTP Mapping: 404 Not Found
NOT_FOUND = 5;
(The error code is detailed here https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto)
(The exception leading to my error message was surfaced here https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/container/v1/gcp_launcher/job_remote_runner.py)
Now the question is, where in my ModelBatchPredictOp()
is a file or directory missing? I've checked to make sure that all of these gcs paths that I've inputted are correct and lead to the expected locations. Any further thoughts?
Edit 2 (2023-05-10):
I noticed that the output of the ModelBatchPredictOp()
component throws me a json detailing some errors. This is what the error body is:
{
"error": {
"code": 401,
"message": "Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.",
"status": "UNAUTHENTICATED",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "CREDENTIALS_MISSING",
"domain": "googleapis.com",
"metadata": {
"service": "aiplatform.googleapis.com",
"method": "google.cloud.aiplatform.v1.JobService.GetBatchPredictionJob"
}
}
]
}
}
However, I have provided the necessary IAM roles to every relevant service agent/account (according to this: https://cloud.google.com/vertex-ai/docs/general/access-control). So still trying to figure out where my pipeline is missing credentials. This is consistent with the original error code, however. Since Error state: 5.
means a directory/file can't be found, it makes sense that missing credentials would cause this error state. However, I am not aware what IAM roles/permissions I am missing in my service account/agents. Another update to be expected soon.
Edit 3 (2023-05-10)
I don't have a solution yet, but I do have another clue of what's going on. All of my batch predictions have been timing out after 18-20 minutes (huge variance but oh well). I think that the model is actually performing the batch predictions properly, but it is not able to write the predictions to the destination bucket. This makes sense because the batch prediction job fails after 20 minutes every single run. I think that whatever code writes the predictions to the bucket has insufficient permissions to perform this write.
The only issue now is that I still don't know where I am supposed to provision the appropriate credentials to perform this write after the batch prediction completes.
答案1
得分: 2
以下是翻译好的内容:
原本我打算编辑我的帖子,但为了给未来的Vertex AI用户留下标记,我将这个问题作为答案发布。
我最初在这里遇到的问题是,我没有使用正确的模型序列化格式。我正在使用一个经过定制训练的TensorFlow模型(具体来说是tf.keras模型),但我错误地使用了.h5
格式,而不是TensorFlow SavedModel格式。为什么这会导致404未找到错误超出了我的理解。关于不正确的序列化格式的错误消息需要更加清晰明了。
我发现这个问题的原因是我试图将这个模型部署到一个在线预测的端点。这导致了以下错误消息:
Vertex AI <noreply-vertexai@google.com>
星期一,5月15日,下午3:49(2天前)
给我
你好,Vertex AI客户,
由于错误,Vertex AI无法创建端点“test_endpoint_appmkt1”。
附加详细信息:
操作状态:带错误失败
资源名称:
projects/971627237637/locations/us-west1/endpoints/910580345751994368
错误消息:模型工件目录应包含以下内容之一:[saved_model.pb, saved_model.pbtxt]。请重新上传具有正确工件目录的模型。
这使我意识到预构建的TensorFlow预测容器可能仅识别TensorFlow SavedModel格式。
此外,如果您正在使用tf.keras模型,请确保您的批量预测输入格式正确。
针对tf.keras模型的在线预测请求正文应如下所示:
{"instances": [
{"<input layer1的名称>": <适当形状的输入值>,
"<input layer2的名称>": <适当形状的输入值>, ...},
<重复上述格式以为下一个记录等的格式...>
]}
与“instances”相关的值中的每个字典都是一个单独的记录。 "<input layer的名称>" 取自 model.summary()
的输出。
对于批量预测,您需要输入一个.jsonl
文件。此文件将由批量预测服务处理为一个请求正文,其格式与在线预测的请求正文相同。然而,jsonl文件需要以以下方式包含与“instances”相关的每个字典:
# 注释:对于记录1
{"<input layer1的名称>": <适当形状的输入值>,
"<input layer2的名称>": <适当形状的输入值>, ...}
# 注释:对于记录2
{"<input layer1的名称>": <适当形状的输入值>,
"<input layer2的名称>": <适当形状的输入值>, ...}
# 注释:对于记录3
{"<input layer1的名称>": <适当形状的输入值>,
"<input layer2的名称>": <适当形状的输入值>, ...}
# 等等...
注意:这是如果您使用Google的预构建TensorFlow预测容器。输入格式与TensorFlow Serving所需的输入格式一致。
如果将来有人在Vertex AI中遇到类似的批量预测问题,请在这里留下评论,我会尽力分享可能的解决方法。Google关于Vertex AI的文档非常糟糕,而且GCP客户支持通常无法及时提供帮助。
英文:
I was originally going to edit my post, but for the sake of marking the trail for future Vertex AI users, I will put this as an answer.
My initial problem here was that I did not use the correct serialization format for my model. I am using a custom-trained tensorflow model (specifically a tf.keras model), and I mistakenly used the .h5
format instead of the TensorFlow SavedModel format. Why this threw a 404 Not Found Error is beyond my understanding. The error message surrounding the incorrect serialization format needs to be MUCH clearer.
The reason why I figured this out was that I tried to deploy this model to an endpoint for online predictions. This errored out with the following message:
Vertex AI <noreply-vertexai@google.com>
Mon, May 15, 3:49 PM (2 days ago)
to me
Hello Vertex AI Customer,
Due to an error, Vertex AI was unable to create
endpoint "test_endpoint_appmkt1".
Additional Details:
Operation State: Failed with errors
Resource Name:
projects/971627237637/locations/us-west1/endpoints/910580345751994368
Error Messages: Model artifact directory is expected to contain exactly one
of: [saved_model.pb, saved_model.pbtxt]. Please re-upload the Model with
correct artifact directory.
This made me realize that the pre-built tensorflow prediction containers likely only recognize the TensorFlow SavedModel format.
Furthermore, if you are using a tensorflow.keras model, make sure that your batch prediction inputs are formatted correctly.
The request body for an online prediction on a tf.keras model should look as follows:
{"instances": [
{"<name of input layer1>": <input values of the appropriate shape>,
"<name of input layer2>": <input values of the appropriate shape>, ...},
<repeat the format of the previous element in this list for your next record, etc...>
]}
Each dictionary in the value associated to "instances" is a single record. The "<name of input layer>" is taken from the output of model.summary()
.
For batch predictions, you need to input a .jsonl
file. This file will be processed by the batch prediction service into a request body that looks like the above request body for online predictions. However, the jsonl file will need to contain each dictionary in the value associated to "instances" in the following way:
# COMMENT: for record 1
{"<name of input layer1>": <input values of the appropriate shape>,
"<name of input layer2>": <input values of the appropriate shape>, ...}
# COMMENT: for record 2
{"<name of input layer1>": <input values of the appropriate shape>,
"<name of input layer2>": <input values of the appropriate shape>, ...}
# COMMENT: for record 3
{"<name of input layer1>": <input values of the appropriate shape>,
"<name of input layer2>": <input values of the appropriate shape>, ...}
# etc...
Note: this is if you use the pre-built tensorflow prediction container from Google. The input format is consistent with the input formats necessary for TensorFlow Serving.
If anyone in the future has similar issues with batch predictions in Vertex AI, please comment here and I will do my best to share what possible workarounds could be. Google's documentation surrounding Vertex AI is absolutely horrendous, and GCP customer support tries their best but is typically not able to help in a timely manner.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论