英文:
AWS Sagemaker model creation failing with internal error
问题
我正在尝试在多模型模式下创建一个SageMaker模型。为此,我使用了Terraform,并且有以下要创建的资源:
module.main.module.toing.aws_sagemaker_model.sagemaker_multimodel 将被创建
+ 资源 "aws_sagemaker_model" "sagemaker_multimodel" {
+ arn = (apply后知道)
+ execution_role_arn = "arn:aws:iam::XXXXX:role/us-west-1-toing_role"
+ id = (apply后知道)
+ name = "dev-us-west-test-model"
+ primary_container {
+ image = "us-west-1-toing_image"
+ mode = "MultiModel"
+ model_data_url = "s3://toing_bucket/"
}
}
这个模型的资源创建第一次失败,出现以下错误:
Error: 创建SageMaker模型时出错: ValidationException: 执行角色ARN "arn:aws:iam::XXXXX:role/us-west-1-toing_role" 无效。请确保该角色存在,并且其信任关系策略允许服务主体 "sagemaker.amazonaws.com" 执行 "sts:AssumeRole" 操作。
│ status code: 400, request id: XXX-XXX-XXX-XXX-XXX
但是,当我在IAM上检查toing_role角色时,它具有必要的权限和信任关系。
信任关系
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
权限策略
AmazonS3FullAccess
AmazonSageMakerFullAccess
当我重新应用时,模型的创建在Terraform上挂起。我在云跟踪中进行了检查,并且对来自Terraform的CreateModel请求出现以下错误响应。这种情况反复发生。
...
"errorCode": "InternalFailure",
"errorMessage": "发生未知错误",
...
当我尝试使用相同的角色手动创建模型时,我收到了 ThrottlingException
。当我等待一段时间然后重新运行请求时,我再次收到 InternalFailure
。
我不明白这是否是我的执行角色的问题,因为相同的代码在eu-west-1中完美运行,但在us-west-1中不行。并且这些角色具有完全相同的权限,如上所示。
S3存储桶也在其各自正确的区域中创建。因此,我不明白这个失败是从何而来的,因为S3存储桶也具有完全相同的权限。
PS1:我还验证了toing_image
容器分别存在于其各自区域的ECR存储库中。
PS2:角色附加到模型如下:
# 定义SageMaker的“假定角色”策略
data "aws_iam_policy_document" "sm_assume_role_policy" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["sagemaker.amazonaws.com"]
}
}
}
resource "aws_iam_role" "sagemaker_inferencer_iam_role" {
name = "${var.app_environment}-inferencer-sm-${var.aws_region}-iam-role-${var.endpoint_postfix}"
assume_role_policy = data.aws_iam_policy_document.sm_assume_role_policy.json
}
# 附加AWS默认策略,“AmazonSageMakerFullAccess”
data "aws_iam_policy" "sm_required_policy" {
name = "AmazonSageMakerFullAccess"
}
resource "aws_iam_role_policy_attachment" "sm_full_access_attach" {
role = aws_iam_role.sagemaker_inferencer_iam_role.name
policy_arn = data.aws_iam_policy.sm_required_policy.arn
}
# 附加AWS默认策略,“AmazonS3FullAccess”
data "aws_iam_policy" "s3_required_policy" {
name = "AmazonS3FullAccess"
}
resource "aws_iam_role_policy_attachment" "s3_full_access_attach" {
role = aws_iam_role.sagemaker_inferencer_iam_role.name
policy_arn = data.aws_iam_policy.s3_required_policy.arn
}
resource "aws_sagemaker_model" "sagemaker_multimodel" {
name = "${var.app_environment}-${var.endpoint_postfix}-model"
execution_role_arn = aws_iam_role.sagemaker_inferencer_iam_role.arn
primary_container {
image = local.multi_model_inferencer_container_name
mode = "MultiModel"
model_data_url = "s3://${local.model_bucket_name}/"
}
tags = var.default_tags
}
我做错了什么?
英文:
I am trying to create a sagemaker model in multimodel mode. For this, I use terraform and I have the following resources to be created:
module.main.module.toing.aws_sagemaker_model.sagemaker_multimodel will be created
+ resource "aws_sagemaker_model" "sagemaker_multimodel" {
+ arn = (known after apply)
+ execution_role_arn = "arn:aws:iam::XXXXX:role/us-west-1-toing_role"
+ id = (known after apply)
+ name = "dev-us-west-test-model"
+ primary_container {
+ image = "us-west-1-toing_image"
+ mode = "MultiModel"
+ model_data_url = "s3://toing_bucket/"
}
}
The resource creation of this model fails the first time with the error:
Error: creating SageMaker model: ValidationException: The execution role ARN "arn:aws:iam::XXXXX:role/us-west-1-toing_role" is invalid. Please ensure that the role exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com".
│ status code: 400, request id: XXX-XXX-XXX-XXX-XXX
However, when I check the toing_role role on iam it has the neccesary permissions and trust relationships.
Trust Relationships
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Permissions policies
AmazonS3FullAccess
AmazonSageMakerFullAccess
When I re-apply, the creation of the model hangs on terraform. I checked on cloud trails, and it has the following error response to the CreateModel request from terraform. This happens repeatedly.
...
"errorCode": "InternalFailure",
"errorMessage": "An unknown error occurred",
...
When I try to create the model manually using the GUI with the same role, I get a ThrottlingException
. When I wait for some time and re-run the request, I get an InternalFailure
agian.
I do not understand if this is problem with my execution role, since the same code works perfectly in eu-west-1, but not in us-west-1. And the roles have exactly the same permissions as above.
The S3 bucket is also created in their right respective regions. Hence I do not understand where this failure comes from, since the S3 buckets too have exactly the same permissions.
PS1: I have also verified that the container toing_image exists separately in the ECR repositories of their respective regions.
PS2: The role is attached to the model as follows:
# Defining the SageMaker "Assume Role" policy
data "aws_iam_policy_document" "sm_assume_role_policy" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["sagemaker.amazonaws.com"]
}
}
}
resource "aws_iam_role" "sagemaker_inferencer_iam_role" {
name = "${var.app_environment}-inferencer-sm-${var.aws_region}-iam-role-${var.endpoint_postfix}"
assume_role_policy = data.aws_iam_policy_document.sm_assume_role_policy.json
}
# Attaching the AWS default policy, "AmazonSageMakerFullAccess"
data "aws_iam_policy" "sm_required_policy" {
name = "AmazonSageMakerFullAccess"
}
resource "aws_iam_role_policy_attachment" "sm_full_access_attach" {
role = aws_iam_role.sagemaker_inferencer_iam_role.name
policy_arn = data.aws_iam_policy.sm_required_policy.arn
}
# Attaching the AWS default policy, "AmazonSageMakerFullAccess"
data "aws_iam_policy" "s3_required_policy" {
name = "AmazonS3FullAccess"
}
resource "aws_iam_role_policy_attachment" "s3_full_access_attach" {
role = aws_iam_role.sagemaker_inferencer_iam_role.name
policy_arn = data.aws_iam_policy.s3_required_policy.arn
}
resource "aws_sagemaker_model" "sagemaker_multimodel" {
name = "${var.app_environment}-${var.endpoint_postfix}-model"
execution_role_arn = aws_iam_role.sagemaker_inferencer_iam_role.arn
primary_container {
image = local.multi_model_inferencer_container_name
mode = "MultiModel"
model_data_url = "s3://${local.model_bucket_name}/"
}
tags = var.default_tags
}
What am I doing wrong?
答案1
得分: 0
这是由于在我尝试运行我的基础架构时,特定地区未启用STS(AWS安全令牌服务)引起的。显然,在某些地区,STS不是默认启用的,必须通过IAM -> 帐户设置 -> STS -> 终端点启用,如此处所解释的那样。
英文:
This was caused by STS (AWS Security Token Service) not being enabled for the specific regions that I was trying to run my infrastructure. Apparently in some regions, STS is not e,nabled by default, and has to be enabled through IAM -> Account Settings -> STS -> Endpoints as explained here.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论