英文:
Read in non-standard JSON format with pyspark
问题
抱歉,无法识别代码并提供翻译。如果您有其他需要翻译的文本,请随时提问。
英文:
I have a file filled with JSON-objects that I want to read in. Unfortunately, the format is a bit non-standard: There are messages that are base64-encoded and some that aren't. The messages that aren't encoded contain another JSON object, which unfortunately includes newlines as well. This messes up reading in the files "the standard way" (spark.read.json("my_file.json")).
The file looks like this:
{"Timestamp":"2022-05-14T00:28:00.2440000Z","Properties":{"connectionDeviceId":"ID1"},"Body":"WWBxssBase64gibberish"}\n
{"Timestamp":"2022-05-14T00:29:14.4700000Z","Properties":{"connectionDeviceId":"ID2"},"Body":[\n
{\n
"more":"Info",\n
"but":"already",\n
"decoded":"!"\n
}\n
]\n
}\n
{"Timestamp":"2022-05-14T00:28:00.2440000Z","Properties":{"connectionDeviceId":"ID1"},"Body":"XxeNiceBodymessageinBase64again"}\n
With the format like this I can't use newline as a linesep.
Is there a good way to cut the file into lines based on opened and closed curly braces? Or how would I write my own parser for that format?
答案1
得分: 1
以下是翻译好的代码部分:
from ast import literal_eval
def yield_correct_structs(string):
local_builder = ""
for s in string.splitlines():
try:
x = literal_eval(s)
yield x
except:
local_builder = local_builder + s
try:
x = literal_eval(local_builder)
yield x
except:
continue
for e in yield_correct_structs(string):
print(type(e), e)
输出结果:
<class 'dict'> {'Timestamp': '2022-05-14T00:28:00.2440000Z', 'Properties': {'connectionDeviceId': 'ID1'}, 'Body': 'WWBxssBase64gibberish'}
<class 'dict'> {'Timestamp': '2022-05-14T00:29:14.4700000Z', 'Properties': {'connectionDeviceId': 'ID2'}, 'Body': [{'more': 'Info', 'but': 'already', 'decoded': '!'}]}
<class 'dict'> {'Timestamp': '2022-05-14T00:29:14.4700000Z', 'Properties': {'connectionDeviceId': 'ID2'}, 'Body': [{'more': 'Info', 'but': 'already', 'decoded': '!'}]}
<class 'dict'> {'Timestamp': '2022-05-14T00:28:00.2440000Z', 'Properties': {'connectionDeviceId': 'ID1'}, 'Body': 'XxeNiceBodymessageinBase64again'}
英文:
The logic that I can find is to use ast.literal_eval
This clubbed together with a try except block can be used to reverse engineer the fact the structures will be correct when continous.
The following is a brief implementation for the same.
Note : Using yield gives you a good speed advantage.
string="""{"Timestamp":"2022-05-14T00:28:00.2440000Z","Properties":{"connectionDeviceId":"ID1"},"Body":"WWBxssBase64gibberish"}\n
{"Timestamp":"2022-05-14T00:29:14.4700000Z","Properties":{"connectionDeviceId":"ID2"},"Body":[\n
{\n
"more":"Info",\n
"but":"already",\n
"decoded":"!"\n
}\n
]\n
}\n
{"Timestamp":"2022-05-14T00:28:00.2440000Z","Properties":{"connectionDeviceId":"ID1"},"Body":"XxeNiceBodymessageinBase64again"}\n"""
from ast import literal_eval
def yield_correct_structs(string):
local_builder = ""
for s in string.splitlines():
try:
x = literal_eval(s)
yield x
except:
local_builder = local_builder + s
try:
x = literal_eval(local_builder)
yield x
except:
continue
for e in yield_correct_structs(string):
print(type(e), e)
Output :
<class 'dict'> {'Timestamp': '2022-05-14T00:28:00.2440000Z', 'Properties': {'connectionDeviceId': 'ID1'}, 'Body': 'WWBxssBase64gibberish'}
<class 'dict'> {'Timestamp': '2022-05-14T00:29:14.4700000Z', 'Properties': {'connectionDeviceId': 'ID2'}, 'Body': [{'more': 'Info', 'but': 'already', 'decoded': '!'}]}
<class 'dict'> {'Timestamp': '2022-05-14T00:29:14.4700000Z', 'Properties': {'connectionDeviceId': 'ID2'}, 'Body': [{'more': 'Info', 'but': 'already', 'decoded': '!'}]}
<class 'dict'> {'Timestamp': '2022-05-14T00:28:00.2440000Z', 'Properties': {'connectionDeviceId': 'ID1'}, 'Body': 'XxeNiceBodymessageinBase64again'}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论