英文:
Converting Json records to parquet using python
问题
以下是翻译好的内容:
我正在尝试将JSON输入记录转换为Parquet格式并发送回输出。我正在获取以下示例JSON记录作为输入。
**输入记录:**
```json
{'id': '37547594730892523208777', 'timestamp': 1518747, 'message': '10-05-2023 04:21:58.092 [pool-2987-thread-1] INFO com.github.vjhdgk.loggenerator.SellRequest - id=32802,ip=188.219.135.214, email=cbhdg3@gmail.com,sex=F,brand=redjh,name=imac Touch,color=cert,options=Disk 32Go,price=329.0'}
我在Lambda函数中使用以下代码将上述JSON日志转换为Parquet格式并返回。
index.py
import base64
import gzip
import json
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import boto3
def lambda_handler(event, context):
print('Received event: %s', event)
output = []
json_string = ''
print(event['records'])
for record in event['records']:
print(record['data'])
data = json.loads(gzip.decompress(
base64.b64decode(record['data'])))
print(data['logEvents'])
for logEvents in data['logEvents']:
print(logEvents)
df = pd.DataFrame(data=processed_messages)
print("df", df.head())
# 将Pandas数据帧转换为Arrow表
table = pa.Table.from_pandas(df)
# 将Arrow表写入内存中的Parquet文件
parquet_bytes = pa.BufferOutputStream()
pq.write_table(table, parquet_bytes)
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(json_string.encode('utf-8')).decode('utf-8')
}
output.append(output_record)
print(json.dumps(output))
return {'records': output}
有人可以帮助我修改这段代码以将上述JSON记录转换为Parquet吗?
我不想将转换后的Parquet记录写入任何文件,只想将它们作为输出返回。谢谢。
希望这对你有所帮助。如果有任何其他问题,请随时提问。
<details>
<summary>英文:</summary>
I am trying to convert json input records as parquet format and send back to the output. i am getting below sample json records as input.
**input records:**
{'id': '37547594730892523208777', 'timestamp': 1518747, 'message': '10-05-2023 04:21:58.092
[pool-2987-thread-1] INFO com.github.vjhdgk.loggenerator.SellRequest - id=32802,ip=188.219.135.214,
email=cbhdg3@gmail.com,sex=F,brand=redjh,name=imac Touch,color=cert,options=Disk 32Go,price=329.0'}
I am using below code in lambda function to convert above json logs to parquet format and send return back.
**index.py**
import base64
import gzip
import json
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import boto3
def lambda_handler(event, context):
print('Received event: %s', event)
output = []
json_string = ''
print(event['records'])
for record in event['records']:
print(record['data'])
data = json.loads(gzip.decompress(
base64.b64decode(record['data'])))
print(data['logEvents'])
for logEvents in data['logEvents']:
print(logEvents)
df = pd.DataFrame(data=processed_messages)
print("df", df.head())
# Convert the Pandas dataframe to an Arrow table
table = pa.Table.from_pandas(df)
# Write the Arrow table to a Parquet file in memory
parquet_bytes = pa.BufferOutputStream()
pq.write_table(table, parquet_bytes)
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(json_string.encode('utf-8')).decode('utf-8')
}
output.append(output_record)
print(json.dumps(output))
return {'records': output}
can anyone help me with this code to convert above json records as parquet.
I don't want to write converted parquet records to any file, i just want them to return back as output. Thanks in advance
</details>
# 答案1
**得分**: 1
"Since the data is already in a dataframe, I think `df.to_parquet(...)` should work."可以翻译为:"由于数据已经在数据框中,我认为 `df.to_parquet(...)` 应该可以工作。"
<details>
<summary>英文:</summary>
Since the data is already in a dataframe, I think `df.to_parquet(...)` should work.
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论