JSON格式错误 – Amazon Personalize 的批处理推断作业输入

huangapple go评论59阅读模式
英文:

JSON malformed error for Batch Inference Job Input - Amazon Personalize

问题

I have created a solution version using the "similar-items" recipe in Amazon Personalize and trying to test it with a batch inference job. I followed AWS documentation which states that the input should be a list of itemIds, with a maximum of 500 items, and each itemId separated with a new line:

{"itemId": "105"}
{"itemId": "106"}
{"itemId": "441"}
...

Accordingly, I wrote the following code to transform my item_ids column into the described JSON format:

# convert item_id column to required JSON format with new lines entered between items
items_json = items_df['ITEM_ID'][1:200].to_json(orient='columns').replace(',', '}\n{')

# write output to JSON file
with open('items_json.json', 'w') as f:
    json.dump(items_json, f)

# write file to S3
from io import StringIO  
import s3fs

# Connect to S3 default profile
s3 = boto3.client('s3')

s3.put_object(
     Body=json.dumps(items_json),
     Bucket='bucket',
     Key='personalize/batch-recommendations-input/items_json.json'
)

Then when I run the batch inference job with that as input, it gives the following error: "User error: Input JSON is malformed."

My sample JSON input looks as follows:

    "{\"itemId\":\"12637\"} {\"itemId\":\"12931\"} {\"itemId\":\"13005\"}"

and after copying it to S3 as follows (adding backslashes to it)- don't know if that's significant in any way:

    "{\"itemId\":\"12637\"}\n{\"itemId\":\"12931\"}\n{\"itemId\":\"13005\"}"

To me, my format looks quite similar to what they asked for, any clue what might be causing the error?

英文:

I have created a solution version using "similar-items" recipe in Amazon Personalize and trying to test it with a batch inference job. I followed AWS documentation which states that the input should be a list of itemIds, with maximum of 500 items, and each itemId separated with a new line:

{"itemId": "105"}
{"itemId": "106"}
{"itemId": "441"}
...

Accordingly, I wrote the following code to transform my item_ids column into the described JSON format:

    # convert item_id column to required JSON format with new lines entered between items
    items_json = items_df['ITEM_ID'][1:200].to_json(orient='columns').replace(',','}\n{')

    # write output to json file
    with open('items_json.json', 'w') as f:
        json.dump(items_json, f)

    # write file to S3
    from io import StringIO  
    import s3fs

    #Connect to S3 default profile
    s3 = boto3.client('s3')

    s3.put_object(
         Body=json.dumps(items_json),
         Bucket='bucket',
         Key='personalize/batch-recommendations-input/items_json.json'
    )

Then when I run the batch inference job with that as input, it gives the following error:
"User error: Input JSON is malformed."

My sample JSON input looks as follows:

    "{"itemId":"12637"} {"itemId":"12931"} {"itemId":"13005"}"

and after copying it to S3 as follows (adding backslashes to it)- don't know if that's significant in anyway:

    "{\"itemId\":\"12637\"}\n{\"itemId\":\"12931\"}\n{\"itemId\":\"13005\"}"

To me, my format looks quite similar to what they asked for, any clue what might be causing the error?

答案1

得分: 1

只需要对to_json的使用进行一些小的更改。具体来说,orient应该是recordslines应该是True

完整示例:

import pandas as pd
import boto3

items_df = pd.read_csv("...")

# 确保项目ID列的名称是"itemId"
item_ids_df = items_df.rename(columns={"ITEM_ID": "itemId"})[["itemId"]]

# 将DataFrame以JSON行格式写入文件
item_ids_df.to_json("job_input.json", orient="records", lines=True)

# 上传到S3
boto3.Session().resource('s3').Bucket(bucket).Object("job_input.json").upload_file("job_input.json")

最后,您提到最大输入项目数为500。实际上,您的输入文件可以有多达50M个输入项目或文件大小为1GB

英文:

You just need some small changes to the use of to_json. Specifically, orient should be records and lines should be True.

Full example:

import pandas as pd
import boto3

items_df = pd.read_csv("...")

# Make sure item ID column name is "itemId"
item_ids_df = items_df.rename(columns={"ITEM_ID": "itemId"})[["itemId"]]

# Write df to file in JSON lines format
item_ids_df.to_json("job_input.json", orient="records", lines=True)

# Upload to S3
boto3.Session().resource('s3').Bucket(bucket).Object("job_input.json").upload_file("job_input.json")

Lastly, you mentioned that the maximum number of input items is 500. Actually, your input file can have up to 50M input items or a file size of 1GB.

huangapple
  • 本文由 发表于 2023年5月29日 19:10:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76356828.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定