如何将 EMR 无服务器 PySpark 的 entryPointArguments 作为变量传递

huangapple go评论65阅读模式
英文:

How to pass EMR Serverless PySpark entryPointArguments as variable

问题

I have an EMR Serverless PySpark job I am launching from a step function. I am trying to pass arguments to SparkSubmit from the entryPointArguments in the form of variables set in the beginning of the step function i.e. today_date, source, tuned_parameters, which I then use in the PySpark code.

I was able to find a partial solution in this post here however I am trying to pass variables from the step function and not the hardcoded argument i.e.. "prd".

        "JobDriver": {
          "SparkSubmit": {
            "EntryPoint": "s3://xxxx-my-code/test/my_code_edited_3.py",
            "EntryPointArguments": ["-env", "prd", "-source.$", "$.source"]
          }
        }

Using argparse I am able to read the first argument "-env" and it is successfully returning "prd", however I am having troubles figuring out how to pass a variable for the source argument.

英文:

I have an EMR Serverless PySpark job I am launching from a step function. I am trying to pass arguments to SparkSubmit from the entryPointArguments in the form of variables set in the beginning of the step function i.e. today_date, source, tuned_parameters, which I then use in the PySpark code.

I was able to find a partial solution in this post here however I am trying to pass variables from the step function and not the hardcoded argument i.e.. "prd".

        "JobDriver": {
          "SparkSubmit": {
            "EntryPoint": "s3://xxxx-my-code/test/my_code_edited_3.py",
            "EntryPointArguments": ["-env", "prd", "-source.$", "$.source"]
          }
        }

Using argparse I am able to read the first argument "-env" and it is successfully returning "prd", however I am having troubles figuring out how to pass a variable for the source argument.

答案1

得分: 2

成功找到了这个问题的答案。将变量参数传递给EMR Serverless SparkSubmit是通过AmazonStateLanguage内置函数实现的。

假设StepFunction的JSON输入是:

{
  "source": "mysource123",
}

在EntryPointArgument中传递这个变量参数的正确方式是:

"EntryPointArguments.$": "States.Array('-source', $.source)"

然后,可以使用argparse在EMR Serverless中的PySpark作业中读取这个变量:

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-source")
args = parser.parse_args()
print(args.source)

打印语句的结果是mysource123。

英文:

Managed to find an answer for this question. Passing variable arguments to EMR Serverless SparkSubmit is achieved with AmazonStateLanguage intrinsic functions.

Provided that the JSON input to the StepFunction is:

    {
  "source": "mysource123",
    }

The correct way to pass this variable argument in the EntryPointArgument is:

"EntryPointArguments.$": "States.Array('-source', $.source)"

Then, using argparse one can read this variable in the PySpark job in EMR Serverless.

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-source")
args = parser.parse_args()
print(args.source)

The result of the print statement is mysource123.

huangapple
  • 本文由 发表于 2023年2月26日 21:02:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75572165.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定