2023年7月3日 20:18:18go评论115阅读模式

英文:

Python - Convert and filter structured text into object

问题

当前问题。
我正在处理一组数据文件，基本上看起来像这样：

{39107,
    {31685,
        {	f24c4ec6-1e59-47a0-9736-8c823eda0d28,
            "N",
            7
        },
        {	c71dce36-4295-49e4-be03-7c60969b96c3,
            "A",
            8
        },
        {	f80fce14-f001-4b20-84d5-7a00f0788f6b,
            "A",
            9
        },
    }
}

和

{0,
    {4659,
        {
                        7c90ea6a-12f5-4c54-bfe0-e38120a6e364,
                        "fieldname27472",
                        "N",
                        27472,
                        "",
                        {3,
                                {"field1",
                                        0,
                                        {1,
                                                {
                                                        "B",
                                                        16,
                                                        0,
                                                        "",
                                                        0
                                                }
                                        },
                                        "",
                                        0
                                },
                                {"field2",
                                        0,
                                        {1,
                                                {
                                                "T",
                                                0,
                                                0,
                                                "",
                                                0}
                                        },
                                        "",
                                        0
                                },
                                {"field3",
                                        0,
                                        {1,
                                                {
                                                        "L",
                                                        0,
                                                        0,
                                                        "",
                                                        0
                                                }
                                        },
                                        "",
                                        0
                                },
                        },
                        {0},
                        {1,
                                {
                                        edcba,
                                        "ByID",
                                        abcde,
                                        1,
                                        {1,
                                                {"ID"}
                                        },
                                        1,
                                        0,
                                        0
                                }
                        },
                        1,
                        "S",
                        {0},
                        {0},
                        "",
                        0,
                        0
                }
        }
}

数字表示数据容器之前的数量，例如 4659 表示后续数据容器的数量。
一些值没有用引号括起来，比如在这个例子中的 uuid 或随机字符串。

我的目标是将这些数据结构转换成 Python 对象，如列表或元组，然后将它们转换为 JSON 以进行外部处理。

现在我有一个两阶段的处理过程。
第一阶段进行初始转换和数据评估。
第二阶段过滤数据，删除多余的值（例如实际元素之前的元素数量）和嵌套列表。

import json
file = 'stack1.json'
def stage1(msg):
    buffer = ''
    st,fh,delim,encase = '[',']',',', '"'
    msg = msg.translate(str.maketrans('{}',st+fh)).replace('\n', '').replace('\r', '').replace('\t', '')
    while True:
        fhpos = msg.find(fh)
        if fhpos >= 0:
            head = msg[:fhpos+1]
            if head:
                stpos = head.rfind(st)
                if stpos >= 0:
                    teststring = head[stpos+1:fhpos].split(delim)
                    for idx,sent in enumerate(teststring):
                        if not (sent.startswith(encase) or sent.endswith(encase)) or sent.count('-') == 4:
                            teststring[idx] = ('"{}"'.format(teststring[idx]))
                            break
                    buffer+= head[:stpos+1]+','.join(teststring)+fh
                else: buffer+=fh
            msg = msg[fhpos+1:]
        else:
            break
    return buffer
def stage2(lst):
    if not any([isinstance(i,list) for i in lst]):
        return tuple(lst)
    if not isinstance(lst[0],list) and all([isinstance(j,list) for j in lst[1:]]):
        lst = stage2(lst[1:])
        if all([isinstance(j,(list,tuple)) for j in lst]) and len(lst) == 1:
            lst, = lst
    for idx,i in enumerate(lst):
        if isinstance(i,list):
            lst[idx] = stage2(i)
        else:
            continue
    return stage2(lst)
with open(file, 'r') as f:
    data = f.read()
    try:
        s1 = stage1(data)
        print("STAGE1\n",s1)
        s2 = stage2(json.loads(s1))
        print("STAGE2\n",json.dumps(s2, indent=2))
    except Exception as e: print(e)

当前结果：

示例1：

STAGE1
[39107,[31685,["f24c4ec6-1e59-47a0-9736-8c823eda0d28","N",7],["c71dce36-4295-49e4-be03-7c60969b96c3","A",8],["f80fce14-f001-4b20-84d5-7a00f0788f6b","A",9]]]
STAGE2
 [
  [
    "f24c4ec6-1e59-47a0-9736-8c823eda0d28",
    "N",
    7
  ],
  [
    "c71dce36-4295-49e4-be03-7c60969b96c3",
    "A",
    8
  ],
  [
    "f80fce14-f001-4b20-84d5-7a00f0788f6b",
    "A",
    9
  ]
]

示例2：

STAGE1
[0,[4659,[7c90ea6a-12f5-4c54-bfe0-e38120a6e364,"fieldname27472","N",27472,"",[3,[aa-aa-a-a-a,"field1",0,[1,["B","16",0,"",0]]],["field2",0,[1,["T","0",0,"",0]]],["field3",0,[1,["L","0",0,"",0]]]],["0"],[1,[edcba,"ByID",abcde,1,["1","ID"]],1,"S",["0"],["0"]]]]]
STAGE2
Expecting ',' delimiter: line 1 column 12 (char 11)

示例2 失败是因为并没有给所有值加上引号。

对于这种情况，哪些库可能适合？
数据集相当大，目前第一个示例约有 5 百万字符，第一阶段需要最多 1 分钟才能处理。

未来问题：
如何最好地转换和

英文:

Current problem.
I'm working with set of data files, which, essentially, look like this:

{39107,
{31685,
{	f24c4ec6-1e59-47a0-9736-8c823eda0d28,
&quot;N&quot;,
7
},
{	c71dce36-4295-49e4-be03-7c60969b96c3,
&quot;A&quot;,
8
},
{	f80fce14-f001-4b20-84d5-7a00f0788f6b,
&quot;A&quot;,
9
},
}
}

And

{0,
{4659,
{
7c90ea6a-12f5-4c54-bfe0-e38120a6e364,
&quot;fieldname27472&quot;,
&quot;N&quot;,
27472,
&quot;&quot;,
{3,
{&quot;field1&quot;,
0,
{1,
{
&quot;B&quot;,
16,
0,
&quot;&quot;,
0
}
},
&quot;&quot;,
0
},
{&quot;field2&quot;,
0,
{1,
{
&quot;T&quot;,
0,
0,
&quot;&quot;,
0}
},
&quot;&quot;,
0
},
{&quot;field3&quot;,
0,
{1,
{
&quot;L&quot;,
0,
0,
&quot;&quot;,
0
}
},
&quot;&quot;,
0
},
},
{0},
{1,
{
edcba,
&quot;ByID&quot;,
abcde,
1,
{1,
&quot;ID&quot;
},
1,
0,
0
}
},
1,
&quot;S&quot;,
{0},
{0},
&quot;&quot;,
0,
0
}
}
}

Numbers before sets of data, eg 4659 represents number of following data containers.
Some values are not encased in quotes, like uuid in this example, or random strings.

My goal is to convert this data structures in python objects, like lists or tuples, then convert them to JSON for external processing.

Right now i have a 2-stage process.
Stage1 does initial conversion and data evaluation.
Stage2 filters data, removing excessive values (like number of elements before actual elements), and nested lists.

import json
file = &#39;stack1.json&#39;
def stage1(msg):
buffer = &#39;&#39;
st,fh,delim,encase = &#39;[&#39;,&#39;]&#39;,&#39;,&#39;, &#39;&quot;&#39;
msg = msg.translate(str.maketrans(&#39;{}&#39;,st+fh)).replace(&#39;\n&#39;, &#39;&#39;).replace(&#39;\r&#39;, &#39;&#39;).replace(&#39;\t&#39;, &#39;&#39;)
while True:
fhpos = msg.find(fh)
if fhpos &gt;= 0:
head = msg[:fhpos+1]
if head:
stpos = head.rfind(st)
if stpos&gt;=0:
teststring = head[stpos+1:fhpos].split(delim)
for idx,sent in enumerate(teststring):
if not (sent.startswith(encase) or sent.endswith(encase)) or sent.count(&#39;-&#39;) == 4:
teststring[idx] = (f&#39;&quot;{teststring[idx]}&quot;&#39;)
break
buffer+= head[:stpos+1]+&#39;,&#39;.join(teststring)+fh
else: buffer+=fh
msg = msg[fhpos+1:]
else:
break
return buffer
def stage2(lst):
if not any([isinstance(i,list) for i in lst]):
return tuple(lst)
if not isinstance(lst[0],list) and all([isinstance(j,list) for j in lst[1:]]):
lst = stage2(lst[1:])
if all([isinstance(j,(list,tuple)) for j in lst]) and len(lst) == 1:
lst, = lst
for idx,i in enumerate(lst):
if isinstance(i,list):
lst[idx] = stage2(i)
else:
continue
return stage2(lst)
with open(file, &#39;r&#39;) as f:
data = f.read()
try:
s1 = stage1(data)
print(&quot;STAGE1\n&quot;,s1)
s2 = stage2(json.loads(s1))
print(&quot;STAGE2\n&quot;,json.dumps(s2, indent=2))
except Exception as e: print(e)

Current results:

Example1:

STAGE1
[39107,[31685,[&quot;f24c4ec6-1e59-47a0-9736-8c823eda0d28&quot;,&quot;N&quot;,7],[&quot;c71dce36-4295-49e4-be03-7c60969b96c3&quot;,&quot;A&quot;,8],[&quot;f80fce14-f001-4b20-84d5-7a00f0788f6b&quot;,&quot;A&quot;,9]]]
STAGE2
[
[
&quot;f24c4ec6-1e59-47a0-9736-8c823eda0d28&quot;,
&quot;N&quot;,
7
],
[
&quot;c71dce36-4295-49e4-be03-7c60969b96c3&quot;,
&quot;A&quot;,
8
],
[
&quot;f80fce14-f001-4b20-84d5-7a00f0788f6b&quot;,
&quot;A&quot;,
9
]
]

Example2:

STAGE1
[0,[4659,[7c90ea6a-12f5-4c54-bfe0-e38120a6e364,&quot;fieldname27472&quot;,&quot;N&quot;,27472,&quot;&quot;,[3,[aa-aa-a-a-a,&quot;field1&quot;,0,[1,[&quot;B&quot;,&quot;16&quot;,0,&quot;&quot;,0]]],[&quot;field2&quot;,0,[1,[&quot;T&quot;,&quot;0&quot;,0,&quot;&quot;,0]]],[&quot;field3&quot;,0,[1,[&quot;L&quot;,&quot;0&quot;,0,&quot;&quot;,0]]]],[&quot;0&quot;],[1,[edcba,&quot;ByID&quot;,abcde,1,[&quot;1&quot;,&quot;ID&quot;]]],1,&quot;S&quot;,[&quot;0&quot;],[&quot;0&quot;]]]]
STAGE2
Expecting &#39;,&#39; delimiter: line 1 column 12 (char 11)

Example 2 failed because not all values got quotes.

What libs might be suitable for this case?
Datasets are rather big, currently first example is ~5M chars, stage1 takes up to 1 minute to process.

Future problem:
What are the best approaches for converting and filtering data like this?
I think converting AND filtering at the same pass is faster, rather than perform full scan several times.
I've read about PLY and PEG, but i don't think this are the right tools for the job.

答案1

得分: 1

只返回翻译好的部分：

我的目标是将这些Python数据结构转换为对象，例如列表或元组，然后将它们转换为JSON以进行外部处理。

实际上，我首先会将字符串转换为有效的JSON。然后，使用 json.loads 将其转换为Python数据结构，之后您可以使用标准迭代来进行过滤和映射，如所需的那样。

如果这些示例完全具有代表性，那么要使其符合JSON规范，基本上需要解决三个问题：

用作数组边界的大括号应替换为方括号
尾随逗号（在最后一个数组元素之后）应删除
十六进制值，可能包括连字符，应用引号括起来（或者它们可以用 0x 前缀进行编码，但是然后较长的数字序列必须分解成部分，因此我不会选择这种方式）。

我还假设：

用作数组边界的左大括号始终会出现在行的开头（忽略空格）
用作数组边界的右大括号——可能紧随其后的是尾随逗号——始终会出现在行的末尾。
未引用的十六进制值将出现在行的开头（忽略空格），但一个左大括号可以在值之前出现。

如果所有这些假设都是正确的，那么以下代码应该有效：

import re
import json
def process(s):
    # 将大括号替换为方括号
    s = re.sub(r"^\s*{\n?", r"[\n", s, flags=re.M)
    s = re.sub(r"}(,?)$", r"]\1", s, flags=re.M)
    # 删除尾随逗号（在JSON中无效）
    s = re.sub(r",$(\s+])", r"\1", s, flags=re.M)
    # 用引号括起十六进制值
    s = re.sub(r'^(\s*)(?=.*[\-a-z])([\w\-]+)', r'\1"\2"', s, flags=re.M)
    return json.loads(s)
with open("stack.json", 'r') as f:
    data = process(f.read())
    print(data)

注意：上述代码是Python代码的翻译，不包括代码的执行或其他环境。

英文:

> My goal is to convert this data structures in python objects, like lists or tuples, then convert them to JSON for external processing.

I would actually first convert the string to valid JSON. Then turn that into a Python data structure with json.loads, after which you can use standard iteration to filter and map as desired.

If the examples are fully representative, then there are essentially 3 "problems" to resolve in order to make it JSON compliant:

Curly braces that function as array boundaries should be replaced with square brackets
Trailing commas (after the last array element) should be removed
Hexadecimal values, possibly including hyphens, should be quoted (alternatively they could be encoded with a 0x prefix, but then longer digit sequences have to be broken up into parts, so I'll not go that way).

I'll also assume that:

opening curly braces that act as array boundaries will always appear at the start of a line (ignoring spacing)
closing curly braces that act as array boundaries -- possibly accompanied by a trailing comma immediately following it -- will always appear at the end of a line.
unquoted hexadecimal values will appear at the start of a line (ignoring spacing), with the exception of one opening brace, which can occur before a value.

If all these assumptions are correct, then the following should work:

import re
import json
def process(s):
# replace braces with square brackets
s = re.sub(r&quot;^(\s*){\n?&quot;, r&quot;\1[\n&quot;, s, flags=re.M)
s = re.sub(r&quot;}(,?)$&quot;, r&quot;]\1&quot;, s, flags=re.M)
# remove trailing commas (not valid in JSON)
s = re.sub(r&quot;,$(\s+])&quot;, r&quot;\1&quot;, s, flags=re.M)
# wrap hex in quotes
s = re.sub(r&#39;^(\s*)(?=.*[\-a-z])([\w\-]+)&#39;, r&#39;\1&quot;\2&quot;&#39;, s, flags=re.M)
return json.loads(s)
with open(&quot;stack.json&quot;, &#39;r&#39;) as f:
data = process(f.read())
print(data)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python – 将结构化文本转换并筛选为对象

问题

答案1

遍历多维度的 PowerShell JSON 对象

如何在Postman中传递复杂对象参数？

“sum” 在 Java 中解析 JSON 时未找到的 JSONObject。

导入Flask中的文件

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。