AWS Sagemaker模型创建失败,出现内部错误。

huangapple go评论187阅读模式
英文:

AWS Sagemaker model creation failing with internal error

问题

我正在尝试在多模型模式下创建一个SageMaker模型。为此,我使用了Terraform,并且有以下要创建的资源:

  1. module.main.module.toing.aws_sagemaker_model.sagemaker_multimodel 将被创建
  2. + 资源 "aws_sagemaker_model" "sagemaker_multimodel" {
  3. + arn = (apply后知道)
  4. + execution_role_arn = "arn:aws:iam::XXXXX:role/us-west-1-toing_role"
  5. + id = (apply后知道)
  6. + name = "dev-us-west-test-model"
  7. + primary_container {
  8. + image = "us-west-1-toing_image"
  9. + mode = "MultiModel"
  10. + model_data_url = "s3://toing_bucket/"
  11. }
  12. }

这个模型的资源创建第一次失败,出现以下错误:

  1. Error: 创建SageMaker模型时出错: ValidationException: 执行角色ARN "arn:aws:iam::XXXXX:role/us-west-1-toing_role" 无效。请确保该角色存在,并且其信任关系策略允许服务主体 "sagemaker.amazonaws.com" 执行 "sts:AssumeRole" 操作。
  2. status code: 400, request id: XXX-XXX-XXX-XXX-XXX

但是,当我在IAM上检查toing_role角色时,它具有必要的权限和信任关系。

信任关系

  1. {
  2. "Version": "2012-10-17",
  3. "Statement": [
  4. {
  5. "Effect": "Allow",
  6. "Principal": {
  7. "Service": "sagemaker.amazonaws.com"
  8. },
  9. "Action": "sts:AssumeRole"
  10. }
  11. ]
  12. }

权限策略

  1. AmazonS3FullAccess
  2. AmazonSageMakerFullAccess

当我重新应用时,模型的创建在Terraform上挂起。我在云跟踪中进行了检查,并且对来自Terraform的CreateModel请求出现以下错误响应。这种情况反复发生。

  1. ...
  2. "errorCode": "InternalFailure",
  3. "errorMessage": "发生未知错误",
  4. ...

当我尝试使用相同的角色手动创建模型时,我收到了 ThrottlingException。当我等待一段时间然后重新运行请求时,我再次收到 InternalFailure

我不明白这是否是我的执行角色的问题,因为相同的代码在eu-west-1中完美运行,但在us-west-1中不行。并且这些角色具有完全相同的权限,如上所示。

S3存储桶也在其各自正确的区域中创建。因此,我不明白这个失败是从何而来的,因为S3存储桶也具有完全相同的权限。

PS1:我还验证了toing_image 容器分别存在于其各自区域的ECR存储库中。

PS2:角色附加到模型如下:

  1. # 定义SageMaker的“假定角色”策略
  2. data "aws_iam_policy_document" "sm_assume_role_policy" {
  3. statement {
  4. actions = ["sts:AssumeRole"]
  5. principals {
  6. type = "Service"
  7. identifiers = ["sagemaker.amazonaws.com"]
  8. }
  9. }
  10. }
  11. resource "aws_iam_role" "sagemaker_inferencer_iam_role" {
  12. name = "${var.app_environment}-inferencer-sm-${var.aws_region}-iam-role-${var.endpoint_postfix}"
  13. assume_role_policy = data.aws_iam_policy_document.sm_assume_role_policy.json
  14. }
  15. # 附加AWS默认策略,“AmazonSageMakerFullAccess”
  16. data "aws_iam_policy" "sm_required_policy" {
  17. name = "AmazonSageMakerFullAccess"
  18. }
  19. resource "aws_iam_role_policy_attachment" "sm_full_access_attach" {
  20. role = aws_iam_role.sagemaker_inferencer_iam_role.name
  21. policy_arn = data.aws_iam_policy.sm_required_policy.arn
  22. }
  23. # 附加AWS默认策略,“AmazonS3FullAccess”
  24. data "aws_iam_policy" "s3_required_policy" {
  25. name = "AmazonS3FullAccess"
  26. }
  27. resource "aws_iam_role_policy_attachment" "s3_full_access_attach" {
  28. role = aws_iam_role.sagemaker_inferencer_iam_role.name
  29. policy_arn = data.aws_iam_policy.s3_required_policy.arn
  30. }
  31. resource "aws_sagemaker_model" "sagemaker_multimodel" {
  32. name = "${var.app_environment}-${var.endpoint_postfix}-model"
  33. execution_role_arn = aws_iam_role.sagemaker_inferencer_iam_role.arn
  34. primary_container {
  35. image = local.multi_model_inferencer_container_name
  36. mode = "MultiModel"
  37. model_data_url = "s3://${local.model_bucket_name}/"
  38. }
  39. tags = var.default_tags
  40. }

我做错了什么?

英文:

I am trying to create a sagemaker model in multimodel mode. For this, I use terraform and I have the following resources to be created:

  1. module.main.module.toing.aws_sagemaker_model.sagemaker_multimodel will be created
  2. + resource "aws_sagemaker_model" "sagemaker_multimodel" {
  3. + arn = (known after apply)
  4. + execution_role_arn = "arn:aws:iam::XXXXX:role/us-west-1-toing_role"
  5. + id = (known after apply)
  6. + name = "dev-us-west-test-model"
  7. + primary_container {
  8. + image = "us-west-1-toing_image"
  9. + mode = "MultiModel"
  10. + model_data_url = "s3://toing_bucket/"
  11. }
  12. }

The resource creation of this model fails the first time with the error:

  1. Error: creating SageMaker model: ValidationException: The execution role ARN "arn:aws:iam::XXXXX:role/us-west-1-toing_role" is invalid. Please ensure that the role exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com".
  2. status code: 400, request id: XXX-XXX-XXX-XXX-XXX

However, when I check the toing_role role on iam it has the neccesary permissions and trust relationships.

Trust Relationships

  1. {
  2. "Version": "2012-10-17",
  3. "Statement": [
  4. {
  5. "Effect": "Allow",
  6. "Principal": {
  7. "Service": "sagemaker.amazonaws.com"
  8. },
  9. "Action": "sts:AssumeRole"
  10. }
  11. ]
  12. }

Permissions policies

  1. AmazonS3FullAccess
  2. AmazonSageMakerFullAccess

When I re-apply, the creation of the model hangs on terraform. I checked on cloud trails, and it has the following error response to the CreateModel request from terraform. This happens repeatedly.

  1. ...
  2. "errorCode": "InternalFailure",
  3. "errorMessage": "An unknown error occurred",
  4. ...

When I try to create the model manually using the GUI with the same role, I get a ThrottlingException. When I wait for some time and re-run the request, I get an InternalFailure agian.

I do not understand if this is problem with my execution role, since the same code works perfectly in eu-west-1, but not in us-west-1. And the roles have exactly the same permissions as above.

The S3 bucket is also created in their right respective regions. Hence I do not understand where this failure comes from, since the S3 buckets too have exactly the same permissions.

PS1: I have also verified that the container toing_image exists separately in the ECR repositories of their respective regions.

PS2: The role is attached to the model as follows:

  1. # Defining the SageMaker "Assume Role" policy
  2. data "aws_iam_policy_document" "sm_assume_role_policy" {
  3. statement {
  4. actions = ["sts:AssumeRole"]
  5. principals {
  6. type = "Service"
  7. identifiers = ["sagemaker.amazonaws.com"]
  8. }
  9. }
  10. }
  11. resource "aws_iam_role" "sagemaker_inferencer_iam_role" {
  12. name = "${var.app_environment}-inferencer-sm-${var.aws_region}-iam-role-${var.endpoint_postfix}"
  13. assume_role_policy = data.aws_iam_policy_document.sm_assume_role_policy.json
  14. }
  15. # Attaching the AWS default policy, "AmazonSageMakerFullAccess"
  16. data "aws_iam_policy" "sm_required_policy" {
  17. name = "AmazonSageMakerFullAccess"
  18. }
  19. resource "aws_iam_role_policy_attachment" "sm_full_access_attach" {
  20. role = aws_iam_role.sagemaker_inferencer_iam_role.name
  21. policy_arn = data.aws_iam_policy.sm_required_policy.arn
  22. }
  23. # Attaching the AWS default policy, "AmazonSageMakerFullAccess"
  24. data "aws_iam_policy" "s3_required_policy" {
  25. name = "AmazonS3FullAccess"
  26. }
  27. resource "aws_iam_role_policy_attachment" "s3_full_access_attach" {
  28. role = aws_iam_role.sagemaker_inferencer_iam_role.name
  29. policy_arn = data.aws_iam_policy.s3_required_policy.arn
  30. }
  31. resource "aws_sagemaker_model" "sagemaker_multimodel" {
  32. name = "${var.app_environment}-${var.endpoint_postfix}-model"
  33. execution_role_arn = aws_iam_role.sagemaker_inferencer_iam_role.arn
  34. primary_container {
  35. image = local.multi_model_inferencer_container_name
  36. mode = "MultiModel"
  37. model_data_url = "s3://${local.model_bucket_name}/"
  38. }
  39. tags = var.default_tags
  40. }

What am I doing wrong?

答案1

得分: 0

这是由于在我尝试运行我的基础架构时,特定地区未启用STS(AWS安全令牌服务)引起的。显然,在某些地区,STS不是默认启用的,必须通过IAM -> 帐户设置 -> STS -> 终端点启用,如此处所解释的那样。

英文:

This was caused by STS (AWS Security Token Service) not being enabled for the specific regions that I was trying to run my infrastructure. Apparently in some regions, STS is not e,nabled by default, and has to be enabled through IAM -> Account Settings -> STS -> Endpoints as explained here.

huangapple
  • 本文由 发表于 2023年6月8日 19:08:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76431211.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定