使用pyspark读取非标准JSON格式

huangapple go评论64阅读模式
英文:

Read in non-standard JSON format with pyspark

问题

抱歉,无法识别代码并提供翻译。如果您有其他需要翻译的文本,请随时提问。

英文:

I have a file filled with JSON-objects that I want to read in. Unfortunately, the format is a bit non-standard: There are messages that are base64-encoded and some that aren't. The messages that aren't encoded contain another JSON object, which unfortunately includes newlines as well. This messes up reading in the files "the standard way" (spark.read.json("my_file.json")).

The file looks like this:

{"Timestamp":"2022-05-14T00:28:00.2440000Z","Properties":{"connectionDeviceId":"ID1"},"Body":"WWBxssBase64gibberish"}\n
{"Timestamp":"2022-05-14T00:29:14.4700000Z","Properties":{"connectionDeviceId":"ID2"},"Body":[\n
{\n
    "more":"Info",\n
    "but":"already",\n
    "decoded":"!"\n
}\n
]\n
}\n
{"Timestamp":"2022-05-14T00:28:00.2440000Z","Properties":{"connectionDeviceId":"ID1"},"Body":"XxeNiceBodymessageinBase64again"}\n

With the format like this I can't use newline as a linesep.

Is there a good way to cut the file into lines based on opened and closed curly braces? Or how would I write my own parser for that format?

答案1

得分: 1

以下是翻译好的代码部分:

from ast import literal_eval

def yield_correct_structs(string):
    local_builder = ""
    for s in string.splitlines():
        try:
            x = literal_eval(s)
            yield x
        except:
            local_builder = local_builder + s
            try:
                x = literal_eval(local_builder)
                yield x
            except:
                continue

for e in yield_correct_structs(string):
    print(type(e), e)

输出结果:

<class 'dict'> {'Timestamp': '2022-05-14T00:28:00.2440000Z', 'Properties': {'connectionDeviceId': 'ID1'}, 'Body': 'WWBxssBase64gibberish'}
<class 'dict'> {'Timestamp': '2022-05-14T00:29:14.4700000Z', 'Properties': {'connectionDeviceId': 'ID2'}, 'Body': [{'more': 'Info', 'but': 'already', 'decoded': '!'}]}
<class 'dict'> {'Timestamp': '2022-05-14T00:29:14.4700000Z', 'Properties': {'connectionDeviceId': 'ID2'}, 'Body': [{'more': 'Info', 'but': 'already', 'decoded': '!'}]}
<class 'dict'> {'Timestamp': '2022-05-14T00:28:00.2440000Z', 'Properties': {'connectionDeviceId': 'ID1'}, 'Body': 'XxeNiceBodymessageinBase64again'}
英文:

The logic that I can find is to use ast.literal_eval

This clubbed together with a try except block can be used to reverse engineer the fact the structures will be correct when continous.

The following is a brief implementation for the same.

Note : Using yield gives you a good speed advantage.
string=&quot;&quot;&quot;{&quot;Timestamp&quot;:&quot;2022-05-14T00:28:00.2440000Z&quot;,&quot;Properties&quot;:{&quot;connectionDeviceId&quot;:&quot;ID1&quot;},&quot;Body&quot;:&quot;WWBxssBase64gibberish&quot;}\n
{&quot;Timestamp&quot;:&quot;2022-05-14T00:29:14.4700000Z&quot;,&quot;Properties&quot;:{&quot;connectionDeviceId&quot;:&quot;ID2&quot;},&quot;Body&quot;:[\n
{\n
    &quot;more&quot;:&quot;Info&quot;,\n
    &quot;but&quot;:&quot;already&quot;,\n
    &quot;decoded&quot;:&quot;!&quot;\n
}\n
]\n
}\n
{&quot;Timestamp&quot;:&quot;2022-05-14T00:28:00.2440000Z&quot;,&quot;Properties&quot;:{&quot;connectionDeviceId&quot;:&quot;ID1&quot;},&quot;Body&quot;:&quot;XxeNiceBodymessageinBase64again&quot;}\n&quot;&quot;&quot;


from ast import literal_eval


def yield_correct_structs(string):
    local_builder = &quot;&quot;
    for s in string.splitlines():
        try:
            x = literal_eval(s)
            yield x
        except:
            local_builder = local_builder + s
            try:
                x = literal_eval(local_builder)
                yield x
            except:
                continue
for e in yield_correct_structs(string):
    print(type(e), e)

Output :

&lt;class &#39;dict&#39;&gt; {&#39;Timestamp&#39;: &#39;2022-05-14T00:28:00.2440000Z&#39;, &#39;Properties&#39;: {&#39;connectionDeviceId&#39;: &#39;ID1&#39;}, &#39;Body&#39;: &#39;WWBxssBase64gibberish&#39;}
&lt;class &#39;dict&#39;&gt; {&#39;Timestamp&#39;: &#39;2022-05-14T00:29:14.4700000Z&#39;, &#39;Properties&#39;: {&#39;connectionDeviceId&#39;: &#39;ID2&#39;}, &#39;Body&#39;: [{&#39;more&#39;: &#39;Info&#39;, &#39;but&#39;: &#39;already&#39;, &#39;decoded&#39;: &#39;!&#39;}]}
&lt;class &#39;dict&#39;&gt; {&#39;Timestamp&#39;: &#39;2022-05-14T00:29:14.4700000Z&#39;, &#39;Properties&#39;: {&#39;connectionDeviceId&#39;: &#39;ID2&#39;}, &#39;Body&#39;: [{&#39;more&#39;: &#39;Info&#39;, &#39;but&#39;: &#39;already&#39;, &#39;decoded&#39;: &#39;!&#39;}]}
&lt;class &#39;dict&#39;&gt; {&#39;Timestamp&#39;: &#39;2022-05-14T00:28:00.2440000Z&#39;, &#39;Properties&#39;: {&#39;connectionDeviceId&#39;: &#39;ID1&#39;}, &#39;Body&#39;: &#39;XxeNiceBodymessageinBase64again&#39;}

huangapple
  • 本文由 发表于 2023年3月7日 19:51:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/75661634.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定