问题

抱歉，无法识别代码并提供翻译。如果您有其他需要翻译的文本，请随时提问。

英文:

I have a file filled with JSON-objects that I want to read in. Unfortunately, the format is a bit non-standard: There are messages that are base64-encoded and some that aren't. The messages that aren't encoded contain another JSON object, which unfortunately includes newlines as well. This messes up reading in the files "the standard way" (spark.read.json("my_file.json")).

The file looks like this:

{&quot;Timestamp&quot;:&quot;2022-05-14T00:28:00.2440000Z&quot;,&quot;Properties&quot;:{&quot;connectionDeviceId&quot;:&quot;ID1&quot;},&quot;Body&quot;:&quot;WWBxssBase64gibberish&quot;}\n
{&quot;Timestamp&quot;:&quot;2022-05-14T00:29:14.4700000Z&quot;,&quot;Properties&quot;:{&quot;connectionDeviceId&quot;:&quot;ID2&quot;},&quot;Body&quot;:[\n
{\n
    &quot;more&quot;:&quot;Info&quot;,\n
    &quot;but&quot;:&quot;already&quot;,\n
    &quot;decoded&quot;:&quot;!&quot;\n
}\n
]\n
}\n
{&quot;Timestamp&quot;:&quot;2022-05-14T00:28:00.2440000Z&quot;,&quot;Properties&quot;:{&quot;connectionDeviceId&quot;:&quot;ID1&quot;},&quot;Body&quot;:&quot;XxeNiceBodymessageinBase64again&quot;}\n

With the format like this I can't use newline as a linesep.

Is there a good way to cut the file into lines based on opened and closed curly braces? Or how would I write my own parser for that format?

答案1

得分: 1

以下是翻译好的代码部分：

from ast import literal_eval

def yield_correct_structs(string):
    local_builder = ""
    for s in string.splitlines():
        try:
            x = literal_eval(s)
            yield x
        except:
            local_builder = local_builder + s
            try:
                x = literal_eval(local_builder)
                yield x
            except:
                continue

for e in yield_correct_structs(string):
    print(type(e), e)

输出结果：

<class 'dict'> {'Timestamp': '2022-05-14T00:28:00.2440000Z', 'Properties': {'connectionDeviceId': 'ID1'}, 'Body': 'WWBxssBase64gibberish'}
<class 'dict'> {'Timestamp': '2022-05-14T00:29:14.4700000Z', 'Properties': {'connectionDeviceId': 'ID2'}, 'Body': [{'more': 'Info', 'but': 'already', 'decoded': '!'}]}
<class 'dict'> {'Timestamp': '2022-05-14T00:29:14.4700000Z', 'Properties': {'connectionDeviceId': 'ID2'}, 'Body': [{'more': 'Info', 'but': 'already', 'decoded': '!'}]}
<class 'dict'> {'Timestamp': '2022-05-14T00:28:00.2440000Z', 'Properties': {'connectionDeviceId': 'ID1'}, 'Body': 'XxeNiceBodymessageinBase64again'}

英文:

The logic that I can find is to use ast.literal_eval

This clubbed together with a try except block can be used to reverse engineer the fact the structures will be correct when continous.

The following is a brief implementation for the same.

Note : Using yield gives you a good speed advantage.

string=&quot;&quot;&quot;{&quot;Timestamp&quot;:&quot;2022-05-14T00:28:00.2440000Z&quot;,&quot;Properties&quot;:{&quot;connectionDeviceId&quot;:&quot;ID1&quot;},&quot;Body&quot;:&quot;WWBxssBase64gibberish&quot;}\n
{&quot;Timestamp&quot;:&quot;2022-05-14T00:29:14.4700000Z&quot;,&quot;Properties&quot;:{&quot;connectionDeviceId&quot;:&quot;ID2&quot;},&quot;Body&quot;:[\n
{\n
    &quot;more&quot;:&quot;Info&quot;,\n
    &quot;but&quot;:&quot;already&quot;,\n
    &quot;decoded&quot;:&quot;!&quot;\n
}\n
]\n
}\n
{&quot;Timestamp&quot;:&quot;2022-05-14T00:28:00.2440000Z&quot;,&quot;Properties&quot;:{&quot;connectionDeviceId&quot;:&quot;ID1&quot;},&quot;Body&quot;:&quot;XxeNiceBodymessageinBase64again&quot;}\n&quot;&quot;&quot;


from ast import literal_eval


def yield_correct_structs(string):
    local_builder = &quot;&quot;
    for s in string.splitlines():
        try:
            x = literal_eval(s)
            yield x
        except:
            local_builder = local_builder + s
            try:
                x = literal_eval(local_builder)
                yield x
            except:
                continue
for e in yield_correct_structs(string):
    print(type(e), e)

Output :

&lt;class &#39;dict&#39;&gt; {&#39;Timestamp&#39;: &#39;2022-05-14T00:28:00.2440000Z&#39;, &#39;Properties&#39;: {&#39;connectionDeviceId&#39;: &#39;ID1&#39;}, &#39;Body&#39;: &#39;WWBxssBase64gibberish&#39;}
&lt;class &#39;dict&#39;&gt; {&#39;Timestamp&#39;: &#39;2022-05-14T00:29:14.4700000Z&#39;, &#39;Properties&#39;: {&#39;connectionDeviceId&#39;: &#39;ID2&#39;}, &#39;Body&#39;: [{&#39;more&#39;: &#39;Info&#39;, &#39;but&#39;: &#39;already&#39;, &#39;decoded&#39;: &#39;!&#39;}]}
&lt;class &#39;dict&#39;&gt; {&#39;Timestamp&#39;: &#39;2022-05-14T00:29:14.4700000Z&#39;, &#39;Properties&#39;: {&#39;connectionDeviceId&#39;: &#39;ID2&#39;}, &#39;Body&#39;: [{&#39;more&#39;: &#39;Info&#39;, &#39;but&#39;: &#39;already&#39;, &#39;decoded&#39;: &#39;!&#39;}]}
&lt;class &#39;dict&#39;&gt; {&#39;Timestamp&#39;: &#39;2022-05-14T00:28:00.2440000Z&#39;, &#39;Properties&#39;: {&#39;connectionDeviceId&#39;: &#39;ID1&#39;}, &#39;Body&#39;: &#39;XxeNiceBodymessageinBase64again&#39;}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用pyspark读取非标准JSON格式

问题

答案1

Note : Using yield gives you a good speed advantage.

Apache SparkSQL无法解析sqlText中创建的给定输入列。

在使用Java绑定创建Spark GraphX中的图时，那些 “evidence” 参数是什么？

如何在pyspark中迭代’Row’值？ “`python # 代码不需要翻译 “`

在Databricks中使用Pyspark dataframe进行奇数列的”Unpivot”操作。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论