Python – 将结构化文本转换并筛选为对象

huangapple go评论79阅读模式
英文:

Python - Convert and filter structured text into object

问题

当前问题。
我正在处理一组数据文件,基本上看起来像这样:

{39107,
    {31685,
        {	f24c4ec6-1e59-47a0-9736-8c823eda0d28,
            "N",
            7
        },
        {	c71dce36-4295-49e4-be03-7c60969b96c3,
            "A",
            8
        },
        {	f80fce14-f001-4b20-84d5-7a00f0788f6b,
            "A",
            9
        },
    }
}

{0,
    {4659,
        {
                        7c90ea6a-12f5-4c54-bfe0-e38120a6e364,
                        "fieldname27472",
                        "N",
                        27472,
                        "",
                        {3,
                                {"field1",
                                        0,
                                        {1,
                                                {
                                                        "B",
                                                        16,
                                                        0,
                                                        "",
                                                        0
                                                }
                                        },
                                        "",
                                        0
                                },
                                {"field2",
                                        0,
                                        {1,
                                                {
                                                "T",
                                                0,
                                                0,
                                                "",
                                                0}
                                        },
                                        "",
                                        0
                                },
                                {"field3",
                                        0,
                                        {1,
                                                {
                                                        "L",
                                                        0,
                                                        0,
                                                        "",
                                                        0
                                                }
                                        },
                                        "",
                                        0
                                },
                        },
                        {0},
                        {1,
                                {
                                        edcba,
                                        "ByID",
                                        abcde,
                                        1,
                                        {1,
                                                {"ID"}
                                        },
                                        1,
                                        0,
                                        0
                                }
                        },
                        1,
                        "S",
                        {0},
                        {0},
                        "",
                        0,
                        0
                }
        }
}

数字表示数据容器之前的数量,例如 4659 表示后续数据容器的数量。
一些值没有用引号括起来,比如在这个例子中的 uuid 或随机字符串。

我的目标是将这些数据结构转换成 Python 对象,如列表或元组,然后将它们转换为 JSON 以进行外部处理。

现在我有一个两阶段的处理过程。
第一阶段进行初始转换和数据评估。
第二阶段过滤数据,删除多余的值(例如实际元素之前的元素数量)和嵌套列表。

import json

file = 'stack1.json'

def stage1(msg):
    buffer = ''
    st,fh,delim,encase = '[',']',',', '"'
    msg = msg.translate(str.maketrans('{}',st+fh)).replace('\n', '').replace('\r', '').replace('\t', '')
    while True:
        fhpos = msg.find(fh)
        if fhpos >= 0:
            head = msg[:fhpos+1]
            if head:
                stpos = head.rfind(st)
                if stpos >= 0:
                    teststring = head[stpos+1:fhpos].split(delim)
                    for idx,sent in enumerate(teststring):
                        if not (sent.startswith(encase) or sent.endswith(encase)) or sent.count('-') == 4:
                            teststring[idx] = ('"{}"'.format(teststring[idx]))
                            break
                    buffer+= head[:stpos+1]+','.join(teststring)+fh
                else: buffer+=fh
            msg = msg[fhpos+1:]
        else:
            break
    return buffer

def stage2(lst):
    if not any([isinstance(i,list) for i in lst]):
        return tuple(lst)
    if not isinstance(lst[0],list) and all([isinstance(j,list) for j in lst[1:]]):
        lst = stage2(lst[1:])
        if all([isinstance(j,(list,tuple)) for j in lst]) and len(lst) == 1:
            lst, = lst
    for idx,i in enumerate(lst):
        if isinstance(i,list):
            lst[idx] = stage2(i)
        else:
            continue
    return stage2(lst)

with open(file, 'r') as f:
    data = f.read()
    try:
        s1 = stage1(data)
        print("STAGE1\n",s1)
        s2 = stage2(json.loads(s1))
        print("STAGE2\n",json.dumps(s2, indent=2))
    except Exception as e: print(e)

当前结果

示例1:

STAGE1
[39107,[31685,["f24c4ec6-1e59-47a0-9736-8c823eda0d28","N",7],["c71dce36-4295-49e4-be03-7c60969b96c3","A",8],["f80fce14-f001-4b20-84d5-7a00f0788f6b","A",9]]]
STAGE2
 [
  [
    "f24c4ec6-1e59-47a0-9736-8c823eda0d28",
    "N",
    7
  ],
  [
    "c71dce36-4295-49e4-be03-7c60969b96c3",
    "A",
    8
  ],
  [
    "f80fce14-f001-4b20-84d5-7a00f0788f6b",
    "A",
    9
  ]
]

示例2:

STAGE1
[0,[4659,[7c90ea6a-12f5-4c54-bfe0-e38120a6e364,"fieldname27472","N",27472,"",[3,[aa-aa-a-a-a,"field1",0,[1,["B","16",0,"",0]]],["field2",0,[1,["T","0",0,"",0]]],["field3",0,[1,["L","0",0,"",0]]]],["0"],[1,[edcba,"ByID",abcde,1,["1","ID"]],1,"S",["0"],["0"]]]]]
STAGE2
Expecting ',' delimiter: line 1 column 12 (char 11)

示例2 失败是因为并没有给所有值加上引号。

对于这种情况,哪些库可能适合?
数据集相当大,目前第一个示例约有 5 百万字符,第一阶段需要最多 1 分钟才能处理。

未来问题:
如何最好地转换和

英文:

Current problem.
I'm working with set of data files, which, essentially, look like this:

{39107,
{31685,
{	f24c4ec6-1e59-47a0-9736-8c823eda0d28,
"N",
7
},
{	c71dce36-4295-49e4-be03-7c60969b96c3,
"A",
8
},
{	f80fce14-f001-4b20-84d5-7a00f0788f6b,
"A",
9
},
}
}

And

{0,
{4659,
{
7c90ea6a-12f5-4c54-bfe0-e38120a6e364,
"fieldname27472",
"N",
27472,
"",
{3,
{"field1",
0,
{1,
{
"B",
16,
0,
"",
0
}
},
"",
0
},
{"field2",
0,
{1,
{
"T",
0,
0,
"",
0}
},
"",
0
},
{"field3",
0,
{1,
{
"L",
0,
0,
"",
0
}
},
"",
0
},
},
{0},
{1,
{
edcba,
"ByID",
abcde,
1,
{1,
"ID"
},
1,
0,
0
}
},
1,
"S",
{0},
{0},
"",
0,
0
}
}
}

Numbers before sets of data, eg 4659 represents number of following data containers.
Some values are not encased in quotes, like uuid in this example, or random strings.

My goal is to convert this data structures in python objects, like lists or tuples, then convert them to JSON for external processing.

Right now i have a 2-stage process.
Stage1 does initial conversion and data evaluation.
Stage2 filters data, removing excessive values (like number of elements before actual elements), and nested lists.

import json
file = 'stack1.json'
def stage1(msg):
buffer = ''
st,fh,delim,encase = '[',']',',', '"'
msg = msg.translate(str.maketrans('{}',st+fh)).replace('\n', '').replace('\r', '').replace('\t', '')
while True:
fhpos = msg.find(fh)
if fhpos >= 0:
head = msg[:fhpos+1]
if head:
stpos = head.rfind(st)
if stpos>=0:
teststring = head[stpos+1:fhpos].split(delim)
for idx,sent in enumerate(teststring):
if not (sent.startswith(encase) or sent.endswith(encase)) or sent.count('-') == 4:
teststring[idx] = (f'"{teststring[idx]}"')
break
buffer+= head[:stpos+1]+','.join(teststring)+fh
else: buffer+=fh
msg = msg[fhpos+1:]
else:
break
return buffer
def stage2(lst):
if not any([isinstance(i,list) for i in lst]):
return tuple(lst)
if not isinstance(lst[0],list) and all([isinstance(j,list) for j in lst[1:]]):
lst = stage2(lst[1:])
if all([isinstance(j,(list,tuple)) for j in lst]) and len(lst) == 1:
lst, = lst
for idx,i in enumerate(lst):
if isinstance(i,list):
lst[idx] = stage2(i)
else:
continue
return stage2(lst)
with open(file, 'r') as f:
data = f.read()
try:
s1 = stage1(data)
print("STAGE1\n",s1)
s2 = stage2(json.loads(s1))
print("STAGE2\n",json.dumps(s2, indent=2))
except Exception as e: print(e)

Current results:

Example1:

STAGE1
[39107,[31685,["f24c4ec6-1e59-47a0-9736-8c823eda0d28","N",7],["c71dce36-4295-49e4-be03-7c60969b96c3","A",8],["f80fce14-f001-4b20-84d5-7a00f0788f6b","A",9]]]
STAGE2
[
[
"f24c4ec6-1e59-47a0-9736-8c823eda0d28",
"N",
7
],
[
"c71dce36-4295-49e4-be03-7c60969b96c3",
"A",
8
],
[
"f80fce14-f001-4b20-84d5-7a00f0788f6b",
"A",
9
]
]

Example2:

STAGE1
[0,[4659,[7c90ea6a-12f5-4c54-bfe0-e38120a6e364,"fieldname27472","N",27472,"",[3,[aa-aa-a-a-a,"field1",0,[1,["B","16",0,"",0]]],["field2",0,[1,["T","0",0,"",0]]],["field3",0,[1,["L","0",0,"",0]]]],["0"],[1,[edcba,"ByID",abcde,1,["1","ID"]]],1,"S",["0"],["0"]]]]
STAGE2
Expecting ',' delimiter: line 1 column 12 (char 11)

Example 2 failed because not all values got quotes.

What libs might be suitable for this case?
Datasets are rather big, currently first example is ~5M chars, stage1 takes up to 1 minute to process.

Future problem:
What are the best approaches for converting and filtering data like this?
I think converting AND filtering at the same pass is faster, rather than perform full scan several times.
I've read about PLY and PEG, but i don't think this are the right tools for the job.

答案1

得分: 1

只返回翻译好的部分:

我的目标是将这些Python数据结构转换为对象,例如列表或元组,然后将它们转换为JSON以进行外部处理。

实际上,我首先会将字符串转换为有效的JSON。然后,使用 json.loads 将其转换为Python数据结构,之后您可以使用标准迭代来进行过滤和映射,如所需的那样。

如果这些示例完全具有代表性,那么要使其符合JSON规范,基本上需要解决三个问题:

  • 用作数组边界的大括号应替换为方括号
  • 尾随逗号(在最后一个数组元素之后)应删除
  • 十六进制值,可能包括连字符,应用引号括起来(或者它们可以用 0x 前缀进行编码,但是然后较长的数字序列必须分解成部分,因此我不会选择这种方式)。

我还假设:

  • 用作数组边界的左大括号始终会出现在行的开头(忽略空格)
  • 用作数组边界的右大括号——可能紧随其后的是尾随逗号——始终会出现在行的末尾。
  • 未引用的十六进制值将出现在行的开头(忽略空格),但一个左大括号可以在值之前出现。

如果所有这些假设都是正确的,那么以下代码应该有效:

import re
import json

def process(s):
    # 将大括号替换为方括号
    s = re.sub(r"^\s*{\n?", r"[\n", s, flags=re.M)
    s = re.sub(r"}(,?)$", r"]\1", s, flags=re.M)
    # 删除尾随逗号(在JSON中无效)
    s = re.sub(r",$(\s+])", r"\1", s, flags=re.M)
    # 用引号括起十六进制值
    s = re.sub(r'^(\s*)(?=.*[\-a-z])([\w\-]+)', r'\1"\2"', s, flags=re.M)
    return json.loads(s)

with open("stack.json", 'r') as f:
    data = process(f.read())
    print(data)

注意:上述代码是Python代码的翻译,不包括代码的执行或其他环境。

英文:

> My goal is to convert this data structures in python objects, like lists or tuples, then convert them to JSON for external processing.

I would actually first convert the string to valid JSON. Then turn that into a Python data structure with json.loads, after which you can use standard iteration to filter and map as desired.

If the examples are fully representative, then there are essentially 3 "problems" to resolve in order to make it JSON compliant:

  • Curly braces that function as array boundaries should be replaced with square brackets
  • Trailing commas (after the last array element) should be removed
  • Hexadecimal values, possibly including hyphens, should be quoted (alternatively they could be encoded with a 0x prefix, but then longer digit sequences have to be broken up into parts, so I'll not go that way).

I'll also assume that:

  • opening curly braces that act as array boundaries will always appear at the start of a line (ignoring spacing)
  • closing curly braces that act as array boundaries -- possibly accompanied by a trailing comma immediately following it -- will always appear at the end of a line.
  • unquoted hexadecimal values will appear at the start of a line (ignoring spacing), with the exception of one opening brace, which can occur before a value.

If all these assumptions are correct, then the following should work:

import re
import json
def process(s):
# replace braces with square brackets
s = re.sub(r"^(\s*){\n?", r"\1[\n", s, flags=re.M)
s = re.sub(r"}(,?)$", r"]\1", s, flags=re.M)
# remove trailing commas (not valid in JSON)
s = re.sub(r",$(\s+])", r"\1", s, flags=re.M)
# wrap hex in quotes
s = re.sub(r'^(\s*)(?=.*[\-a-z])([\w\-]+)', r'\1"\2"', s, flags=re.M)
return json.loads(s)
with open("stack.json", 'r') as f:
data = process(f.read())
print(data)

huangapple
  • 本文由 发表于 2023年7月3日 20:18:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76604676.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定