AWS Sagemaker模型创建失败,出现内部错误。

huangapple go评论137阅读模式
英文:

AWS Sagemaker model creation failing with internal error

问题

我正在尝试在多模型模式下创建一个SageMaker模型。为此,我使用了Terraform,并且有以下要创建的资源:

module.main.module.toing.aws_sagemaker_model.sagemaker_multimodel 将被创建
+ 资源 "aws_sagemaker_model" "sagemaker_multimodel" {
    + arn                = (apply后知道)
    + execution_role_arn = "arn:aws:iam::XXXXX:role/us-west-1-toing_role"
    + id                 = (apply后知道)
    + name               = "dev-us-west-test-model"
    + primary_container {
        + image          = "us-west-1-toing_image"
        + mode           = "MultiModel"
        + model_data_url = "s3://toing_bucket/"
    }
}

这个模型的资源创建第一次失败,出现以下错误:

Error: 创建SageMaker模型时出错: ValidationException: 执行角色ARN "arn:aws:iam::XXXXX:role/us-west-1-toing_role" 无效。请确保该角色存在,并且其信任关系策略允许服务主体 "sagemaker.amazonaws.com" 执行 "sts:AssumeRole" 操作。
│       status code: 400, request id: XXX-XXX-XXX-XXX-XXX

但是,当我在IAM上检查toing_role角色时,它具有必要的权限和信任关系。

信任关系

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

权限策略

AmazonS3FullAccess
AmazonSageMakerFullAccess

当我重新应用时,模型的创建在Terraform上挂起。我在云跟踪中进行了检查,并且对来自Terraform的CreateModel请求出现以下错误响应。这种情况反复发生。

...
"errorCode": "InternalFailure",
"errorMessage": "发生未知错误",
...

当我尝试使用相同的角色手动创建模型时,我收到了 ThrottlingException。当我等待一段时间然后重新运行请求时,我再次收到 InternalFailure

我不明白这是否是我的执行角色的问题,因为相同的代码在eu-west-1中完美运行,但在us-west-1中不行。并且这些角色具有完全相同的权限,如上所示。

S3存储桶也在其各自正确的区域中创建。因此,我不明白这个失败是从何而来的,因为S3存储桶也具有完全相同的权限。

PS1:我还验证了toing_image 容器分别存在于其各自区域的ECR存储库中。

PS2:角色附加到模型如下:

# 定义SageMaker的“假定角色”策略
data "aws_iam_policy_document" "sm_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["sagemaker.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "sagemaker_inferencer_iam_role" {
  name               = "${var.app_environment}-inferencer-sm-${var.aws_region}-iam-role-${var.endpoint_postfix}"
  assume_role_policy = data.aws_iam_policy_document.sm_assume_role_policy.json
}

# 附加AWS默认策略,“AmazonSageMakerFullAccess”
data "aws_iam_policy" "sm_required_policy" {
  name = "AmazonSageMakerFullAccess"
}

resource "aws_iam_role_policy_attachment" "sm_full_access_attach" {
  role       = aws_iam_role.sagemaker_inferencer_iam_role.name
  policy_arn = data.aws_iam_policy.sm_required_policy.arn
}

# 附加AWS默认策略,“AmazonS3FullAccess”
data "aws_iam_policy" "s3_required_policy" {
  name = "AmazonS3FullAccess"
}

resource "aws_iam_role_policy_attachment" "s3_full_access_attach" {
  role       = aws_iam_role.sagemaker_inferencer_iam_role.name
  policy_arn = data.aws_iam_policy.s3_required_policy.arn
}

resource "aws_sagemaker_model" "sagemaker_multimodel" {
  name               = "${var.app_environment}-${var.endpoint_postfix}-model"
  execution_role_arn = aws_iam_role.sagemaker_inferencer_iam_role.arn

  primary_container {
    image          = local.multi_model_inferencer_container_name
    mode           = "MultiModel"
    model_data_url = "s3://${local.model_bucket_name}/"
  }

  tags = var.default_tags
}

我做错了什么?

英文:

I am trying to create a sagemaker model in multimodel mode. For this, I use terraform and I have the following resources to be created:

module.main.module.toing.aws_sagemaker_model.sagemaker_multimodel will be created
  + resource "aws_sagemaker_model" "sagemaker_multimodel" {
      + arn                = (known after apply)
      + execution_role_arn = "arn:aws:iam::XXXXX:role/us-west-1-toing_role"
      + id                 = (known after apply)
      + name               = "dev-us-west-test-model"
      + primary_container {
          + image          = "us-west-1-toing_image"
          + mode           = "MultiModel"
          + model_data_url = "s3://toing_bucket/"
        }
    }

The resource creation of this model fails the first time with the error:

Error: creating SageMaker model: ValidationException: The execution role ARN "arn:aws:iam::XXXXX:role/us-west-1-toing_role" is invalid. Please ensure that the role exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com".
│       status code: 400, request id: XXX-XXX-XXX-XXX-XXX

However, when I check the toing_role role on iam it has the neccesary permissions and trust relationships.

Trust Relationships

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Permissions policies

AmazonS3FullAccess
AmazonSageMakerFullAccess

When I re-apply, the creation of the model hangs on terraform. I checked on cloud trails, and it has the following error response to the CreateModel request from terraform. This happens repeatedly.

...
"errorCode": "InternalFailure",
"errorMessage": "An unknown error occurred",
...

When I try to create the model manually using the GUI with the same role, I get a ThrottlingException. When I wait for some time and re-run the request, I get an InternalFailure agian.

I do not understand if this is problem with my execution role, since the same code works perfectly in eu-west-1, but not in us-west-1. And the roles have exactly the same permissions as above.

The S3 bucket is also created in their right respective regions. Hence I do not understand where this failure comes from, since the S3 buckets too have exactly the same permissions.

PS1: I have also verified that the container toing_image exists separately in the ECR repositories of their respective regions.

PS2: The role is attached to the model as follows:

# Defining the SageMaker "Assume Role" policy
data "aws_iam_policy_document" "sm_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["sagemaker.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "sagemaker_inferencer_iam_role" {
  name               = "${var.app_environment}-inferencer-sm-${var.aws_region}-iam-role-${var.endpoint_postfix}"
  assume_role_policy = data.aws_iam_policy_document.sm_assume_role_policy.json
}

# Attaching the AWS default policy, "AmazonSageMakerFullAccess"
data "aws_iam_policy" "sm_required_policy" {
  name = "AmazonSageMakerFullAccess"
}

resource "aws_iam_role_policy_attachment" "sm_full_access_attach" {
  role       = aws_iam_role.sagemaker_inferencer_iam_role.name
  policy_arn = data.aws_iam_policy.sm_required_policy.arn
}

# Attaching the AWS default policy, "AmazonSageMakerFullAccess"
data "aws_iam_policy" "s3_required_policy" {
  name = "AmazonS3FullAccess"
}

resource "aws_iam_role_policy_attachment" "s3_full_access_attach" {
  role       = aws_iam_role.sagemaker_inferencer_iam_role.name
  policy_arn = data.aws_iam_policy.s3_required_policy.arn
}

resource "aws_sagemaker_model" "sagemaker_multimodel" {
  name               = "${var.app_environment}-${var.endpoint_postfix}-model"
  execution_role_arn = aws_iam_role.sagemaker_inferencer_iam_role.arn

  primary_container {
    image          = local.multi_model_inferencer_container_name
    mode           = "MultiModel"
    model_data_url = "s3://${local.model_bucket_name}/"
  }

  tags = var.default_tags
}

What am I doing wrong?

答案1

得分: 0

这是由于在我尝试运行我的基础架构时,特定地区未启用STS(AWS安全令牌服务)引起的。显然,在某些地区,STS不是默认启用的,必须通过IAM -> 帐户设置 -> STS -> 终端点启用,如此处所解释的那样。

英文:

This was caused by STS (AWS Security Token Service) not being enabled for the specific regions that I was trying to run my infrastructure. Apparently in some regions, STS is not e,nabled by default, and has to be enabled through IAM -> Account Settings -> STS -> Endpoints as explained here.

huangapple
  • 本文由 发表于 2023年6月8日 19:08:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76431211.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定