2023年6月8日 19:08:48go评论187阅读模式

英文:

AWS Sagemaker model creation failing with internal error

问题

我正在尝试在多模型模式下创建一个SageMaker模型。为此，我使用了Terraform，并且有以下要创建的资源：

module.main.module.toing.aws_sagemaker_model.sagemaker_multimodel 将被创建
+ 资源 "aws_sagemaker_model" "sagemaker_multimodel" {
    + arn                = (apply后知道)
    + execution_role_arn = "arn:aws:iam::XXXXX:role/us-west-1-toing_role"
    + id                 = (apply后知道)
    + name               = "dev-us-west-test-model"
    + primary_container {
        + image          = "us-west-1-toing_image"
        + mode           = "MultiModel"
        + model_data_url = "s3://toing_bucket/"
    }
}

这个模型的资源创建第一次失败，出现以下错误：

Error: 创建SageMaker模型时出错: ValidationException: 执行角色ARN "arn:aws:iam::XXXXX:role/us-west-1-toing_role" 无效。请确保该角色存在，并且其信任关系策略允许服务主体 "sagemaker.amazonaws.com" 执行 "sts:AssumeRole" 操作。
│       status code: 400, request id: XXX-XXX-XXX-XXX-XXX

但是，当我在IAM上检查toing_role角色时，它具有必要的权限和信任关系。

信任关系

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

权限策略

AmazonS3FullAccess
AmazonSageMakerFullAccess

当我重新应用时，模型的创建在Terraform上挂起。我在云跟踪中进行了检查，并且对来自Terraform的CreateModel请求出现以下错误响应。这种情况反复发生。

...
"errorCode": "InternalFailure",
"errorMessage": "发生未知错误",
...

当我尝试使用相同的角色手动创建模型时，我收到了 ThrottlingException。当我等待一段时间然后重新运行请求时，我再次收到 InternalFailure。

我不明白这是否是我的执行角色的问题，因为相同的代码在eu-west-1中完美运行，但在us-west-1中不行。并且这些角色具有完全相同的权限，如上所示。

S3存储桶也在其各自正确的区域中创建。因此，我不明白这个失败是从何而来的，因为S3存储桶也具有完全相同的权限。

PS1：我还验证了toing_image 容器分别存在于其各自区域的ECR存储库中。

PS2：角色附加到模型如下：

# 定义SageMaker的“假定角色”策略
data "aws_iam_policy_document" "sm_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["sagemaker.amazonaws.com"]
    }
  }
}
resource "aws_iam_role" "sagemaker_inferencer_iam_role" {
  name               = "${var.app_environment}-inferencer-sm-${var.aws_region}-iam-role-${var.endpoint_postfix}"
  assume_role_policy = data.aws_iam_policy_document.sm_assume_role_policy.json
}
# 附加AWS默认策略，“AmazonSageMakerFullAccess”
data "aws_iam_policy" "sm_required_policy" {
  name = "AmazonSageMakerFullAccess"
}
resource "aws_iam_role_policy_attachment" "sm_full_access_attach" {
  role       = aws_iam_role.sagemaker_inferencer_iam_role.name
  policy_arn = data.aws_iam_policy.sm_required_policy.arn
}
# 附加AWS默认策略，“AmazonS3FullAccess”
data "aws_iam_policy" "s3_required_policy" {
  name = "AmazonS3FullAccess"
}
resource "aws_iam_role_policy_attachment" "s3_full_access_attach" {
  role       = aws_iam_role.sagemaker_inferencer_iam_role.name
  policy_arn = data.aws_iam_policy.s3_required_policy.arn
}
resource "aws_sagemaker_model" "sagemaker_multimodel" {
  name               = "${var.app_environment}-${var.endpoint_postfix}-model"
  execution_role_arn = aws_iam_role.sagemaker_inferencer_iam_role.arn
  primary_container {
    image          = local.multi_model_inferencer_container_name
    mode           = "MultiModel"
    model_data_url = "s3://${local.model_bucket_name}/"
  }
  tags = var.default_tags
}

我做错了什么？

英文:

I am trying to create a sagemaker model in multimodel mode. For this, I use terraform and I have the following resources to be created:

module.main.module.toing.aws_sagemaker_model.sagemaker_multimodel will be created
  + resource &quot;aws_sagemaker_model&quot; &quot;sagemaker_multimodel&quot; {
      + arn                = (known after apply)
      + execution_role_arn = &quot;arn:aws:iam::XXXXX:role/us-west-1-toing_role&quot;
      + id                 = (known after apply)
      + name               = &quot;dev-us-west-test-model&quot;
      + primary_container {
          + image          = &quot;us-west-1-toing_image&quot;
          + mode           = &quot;MultiModel&quot;
          + model_data_url = &quot;s3://toing_bucket/&quot;
        }
    }

The resource creation of this model fails the first time with the error:

Error: creating SageMaker model: ValidationException: The execution role ARN &quot;arn:aws:iam::XXXXX:role/us-west-1-toing_role&quot; is invalid. Please ensure that the role exists and that its trust relationship policy allows the action &quot;sts:AssumeRole&quot; for the service principal &quot;sagemaker.amazonaws.com&quot;.
│       status code: 400, request id: XXX-XXX-XXX-XXX-XXX

However, when I check the toing_role role on iam it has the neccesary permissions and trust relationships.

Trust Relationships

{
    &quot;Version&quot;: &quot;2012-10-17&quot;,
    &quot;Statement&quot;: [
        {
            &quot;Effect&quot;: &quot;Allow&quot;,
            &quot;Principal&quot;: {
                &quot;Service&quot;: &quot;sagemaker.amazonaws.com&quot;
            },
            &quot;Action&quot;: &quot;sts:AssumeRole&quot;
        }
    ]
}

Permissions policies

AmazonS3FullAccess
AmazonSageMakerFullAccess

When I re-apply, the creation of the model hangs on terraform. I checked on cloud trails, and it has the following error response to the CreateModel request from terraform. This happens repeatedly.

...
&quot;errorCode&quot;: &quot;InternalFailure&quot;,
&quot;errorMessage&quot;: &quot;An unknown error occurred&quot;,
...

When I try to create the model manually using the GUI with the same role, I get a ThrottlingException. When I wait for some time and re-run the request, I get an InternalFailure agian.

I do not understand if this is problem with my execution role, since the same code works perfectly in eu-west-1, but not in us-west-1. And the roles have exactly the same permissions as above.

The S3 bucket is also created in their right respective regions. Hence I do not understand where this failure comes from, since the S3 buckets too have exactly the same permissions.

PS1: I have also verified that the container toing_image exists separately in the ECR repositories of their respective regions.

PS2: The role is attached to the model as follows:

# Defining the SageMaker &quot;Assume Role&quot; policy
data &quot;aws_iam_policy_document&quot; &quot;sm_assume_role_policy&quot; {
  statement {
    actions = [&quot;sts:AssumeRole&quot;]
    principals {
      type        = &quot;Service&quot;
      identifiers = [&quot;sagemaker.amazonaws.com&quot;]
    }
  }
}
resource &quot;aws_iam_role&quot; &quot;sagemaker_inferencer_iam_role&quot; {
  name               = &quot;${var.app_environment}-inferencer-sm-${var.aws_region}-iam-role-${var.endpoint_postfix}&quot;
  assume_role_policy = data.aws_iam_policy_document.sm_assume_role_policy.json
}
# Attaching the AWS default policy, &quot;AmazonSageMakerFullAccess&quot;
data &quot;aws_iam_policy&quot; &quot;sm_required_policy&quot; {
  name = &quot;AmazonSageMakerFullAccess&quot;
}
resource &quot;aws_iam_role_policy_attachment&quot; &quot;sm_full_access_attach&quot; {
  role       = aws_iam_role.sagemaker_inferencer_iam_role.name
  policy_arn = data.aws_iam_policy.sm_required_policy.arn
}
# Attaching the AWS default policy, &quot;AmazonSageMakerFullAccess&quot;
data &quot;aws_iam_policy&quot; &quot;s3_required_policy&quot; {
  name = &quot;AmazonS3FullAccess&quot;
}
resource &quot;aws_iam_role_policy_attachment&quot; &quot;s3_full_access_attach&quot; {
  role       = aws_iam_role.sagemaker_inferencer_iam_role.name
  policy_arn = data.aws_iam_policy.s3_required_policy.arn
}
resource &quot;aws_sagemaker_model&quot; &quot;sagemaker_multimodel&quot; {
  name               = &quot;${var.app_environment}-${var.endpoint_postfix}-model&quot;
  execution_role_arn = aws_iam_role.sagemaker_inferencer_iam_role.arn
  primary_container {
    image          = local.multi_model_inferencer_container_name
    mode           = &quot;MultiModel&quot;
    model_data_url = &quot;s3://${local.model_bucket_name}/&quot;
  }
  tags = var.default_tags
}

What am I doing wrong?

答案1

得分: 0

这是由于在我尝试运行我的基础架构时，特定地区未启用STS（AWS安全令牌服务）引起的。显然，在某些地区，STS不是默认启用的，必须通过IAM -> 帐户设置 -> STS -> 终端点启用，如此处所解释的那样。

英文:

This was caused by STS (AWS Security Token Service) not being enabled for the specific regions that I was trying to run my infrastructure. Apparently in some regions, STS is not e,nabled by default, and has to be enabled through IAM -> Account Settings -> STS -> Endpoints as explained here.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

AWS Sagemaker模型创建失败，出现内部错误。

问题

答案1

如何使用Golang客户端调用带有IAM授权的API Gateway端点

Update/delete aws_s3_object results in “InvalidArgument: Invalid attribute name specified.”

How can I catch 410 error from `postToConnection` call via golang aws sdk?

可能的 Cloudfront X-Cache 值

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。