TypeError: 类型为Properties的对象不可JSON序列化 (Sagemaker管道)

huangapple go评论74阅读模式
英文:

TypeError: Object of type Properties is not JSON serializable (Sagemaker Pipeline)

问题

以下是代码部分的中文翻译:

  1. 设置第一步(缩放步骤):
scaling_processor = SKLearnProcessor(
    framework_version=FRAMEWORK_VERSION,
    instance_type="ml.m5.4xlarge",
    instance_count=processing_instance_count,
    base_job_name="data-process",
    role=role,
    sagemaker_session=pipeline_session,
)

scaling_processor_args = scaling_processor.run(
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="scaled_data", source="/opt/ml/processing/output/scaled_data/"),
        ProcessingOutput(output_name="train", source="/opt/ml/processing/output/train/"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/output/test/")
    ],
    code="scripts/preprocess.py",
)

step_process = ProcessingStep(name="DataProcess", step_args=scaling_processor_args)
  1. 设置第二步(RF训练-BYO模式),这里是错误发生的地方:
estimator_cls = sagemaker.sklearn.SKLearn
FRAMEWORK_VERSION = "0.23-1"

rf_processor = FrameworkProcessor(
    estimator_cls,
    FRAMEWORK_VERSION,
    role=role,
    instance_count=1,
    instance_type='ml.m5.2xlarge',
    base_job_name='rf-modelling'
)

rf_processor_args = rf_processor.run(
    inputs=[
        ProcessingInput(source=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
                        destination="/opt/ml/processing/input"),
        ProcessingInput(source=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
                        destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="rf_model", source="/opt/ml/processing/output/")
    ],
    code="scripts/train.py",
)

step_train = ProcessingStep(name="RFTrain", step_args=rf_processor_args)

请注意,您提到的错误可能是因为在第二步的rf_processor_args中使用了step_process.properties,并且某些属性不是JSON可序列化的。可能需要检查这些属性以解决问题。

英文:

I am trying to set up a Sagemaker pipeline that has 2 steps: preprocessing then training an RF model.
The first step produces 3 outputs: a scaled_data.csv, train.csv, and test.csv. The second step should take train and test CSVs to train the RF model. An error arises when running step 2 stating "TypeError: Object of type Properties is not JSON serializable".

Here is my code for setting the pipeline steps:

# upload data from local path to default bucket with prefix raw_data
WORK_DIRECTORY = "data"

input_data = sagemaker_session.upload_data(
    path="{}/{}".format(WORK_DIRECTORY, "dataset.csv"),
    bucket=bucket,
    key_prefix="{}/{}".format(prefix, "input_data"),
)
  1. setting up the first step (scaling step)
scaling_processor = SKLearnProcessor(
    framework_version=FRAMEWORK_VERSION,
    instance_type="ml.m5.4xlarge",
    instance_count=processing_instance_count,
    base_job_name="data-process",
    role=role,
    sagemaker_session=pipeline_session,
)

scaling_processor_args = scaling_processor.run(
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="scaled_data", source="/opt/ml/processing/output/scaled_data/"),
        ProcessingOutput(output_name="train", source="/opt/ml/processing/output/train/"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/output/test/")    ],
    
    code="scripts/preprocess.py",
)

step_process = ProcessingStep(name="DataProcess", step_args=scaling_processor_args)
  1. setting up the 2nd step (RF Training-BYO mode), here is where the error arises:
estimator_cls = sagemaker.sklearn.SKLearn
FRAMEWORK_VERSION = "0.23-1"

rf_processor = FrameworkProcessor(
    estimator_cls,
    FRAMEWORK_VERSION,
    role = role,
    instance_count=1,
    instance_type='ml.m5.2xlarge',
    base_job_name='rf-modelling'
)

rf_processor_args = rf_processor.run(
    inputs=[
        ProcessingInput(source=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
                        destination="/opt/ml/processing/input"),
        ProcessingInput(source=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
                        destination="/opt/ml/processing/input"),    ],
    outputs=[
        ProcessingOutput(output_name="rf_model",source = "/opt/ml/processing/output/")
   ],
    
    code="scripts/train.py",
)

step_train = ProcessingStep(name="RFTrain", step_args=rf_processor_args)

An error arises when running step 2 stating "TypeError: Object of type Properties is not JSON serializable". The problem is with the line where I set processingInput for the 2nd step in rf_processor_args.
Any ideas what might cause this error?

答案1

得分: 1

以下是翻译好的部分:

# 设置第一个步骤(缩放步骤)

...

# --> 直接使用 ProcessingStep 并提供所有参数
step_process = ProcessingStep(
    name="DataProcess",
    processor=scaling_processor,
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="scaled_data", source="/opt/ml/processing/output/scaled_data/"),
        ProcessingOutput(output_name="train", source="/opt/ml/processing/output/train/"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/output/test/"),
    ],
    code="scripts/preprocess.py",
)

# 设置第二个步骤(RF 训练-BYO 模式)

...

# --> 直接使用 ProcessingStep 并提供所有参数
step_train = ProcessingStep(
    name="RFTrain",
    processor=rf_processor,
    inputs=[
        ProcessingInput(source=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
                        destination="/opt/ml/processing/input/train"),
        ProcessingInput(source=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
                        destination="/opt/ml/processing/input/test"),
    ],
    outputs=[
        ProcessingOutput(output_name="rf_model", source="/opt/ml/processing/output/"),
    ],
    code="scripts/train.py",
)
英文:

The run() you call seems to be incorrect choice, as run() is used to run the processing job directly, instead of defining the pipeline steps, which you apparently want. Use ProcessingStep directly and feed it with all the necessary arguments:

# Setup the first step (scaling step)

...

# --> Use ProcessingStep directly and provide all the args
step_process = ProcessingStep(
    name="DataProcess",
    processor=scaling_processor,
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="scaled_data", source="/opt/ml/processing/output/scaled_data/"),
        ProcessingOutput(output_name="train", source="/opt/ml/processing/output/train/"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/output/test/"),
    ],
    code="scripts/preprocess.py",
)

# Setup the 2nd step (RF Training-BYO mode)

...

# --> Use ProcessingStep directly and provide all the args
step_train = ProcessingStep(
    name="RFTrain",
    processor=rf_processor,
    inputs=[
        ProcessingInput(source=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
                        destination="/opt/ml/processing/input/train"),
        ProcessingInput(source=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
                        destination="/opt/ml/processing/input/test"),
    ],
    outputs=[
        ProcessingOutput(output_name="rf_model", source="/opt/ml/processing/output/"),
    ],
    code="scripts/train.py",
)

huangapple
  • 本文由 发表于 2023年4月14日 00:06:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76007568.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定