How to print all duplicates key, including full paths, and optionally with values, for nested JSON in Python?

huangapple go评论59阅读模式
英文:

How to print all duplicates key, including full paths, and optionally with values, for nested JSON in Python?

问题

外部库允许但不太推荐。

示例输入:

data.json 内容如下:

{
    "name": "John",
    "age": 30,
    "address": {
        "street": "123 Main St",
        "city": "New York",
        "street": "321 Wall St"
    },
    "contacts": [
        {
            "type": "email",
            "value": "john@example.com"
        },
        {
            "type": "phone",
            "value": "555-1234"
        },
        {
            "type": "email",
            "value": "johndoe@example.com"
        }
    ],
    "age": 35
}

示例预期输出:

找到重复键:
  age (30, 35)
  address -> street ("123 Main St", "321 Wall St")

使用 json.load/s 返回标准的 Python 字典将会删除重复项,所以我认为我们需要一种在加载 JSON 时以某种深度优先搜索/访问者模式方式进行“流式”处理的方法。

我还尝试了类似于这里建议的方法:https://stackoverflow.com/a/14902564/8878330(如下引用)

def dict_raise_on_duplicates(ordered_pairs):
    """拒绝重复的键。"""
    d = {}
    for k, v in ordered_pairs:
        if k in d:
           raise ValueError("重复的键:%r" % (k,))
        else:
           d[k] = v
    return d

我唯一的更改是,不是引发异常,而是将重复的键追加到列表中,这样我可以在最后打印重复的键列表。

问题是,我不知道如何简单地获取重复键的“完整路径”。

英文:

External libraries are allowed but less preferred.

Example input:

data.json with content:

{
    "name": "John",
    "age": 30,
    "address": {
        "street": "123 Main St",
        "city": "New York",
	"street": "321 Wall St"
    },
    "contacts": [
        {
            "type": "email",
            "value": "john@example.com"
        },
        {
            "type": "phone",
            "value": "555-1234"
        },
        {
            "type": "email",
            "value": "johndoe@example.com"
        }
    ],
    "age": 35
}

Example expected output:

Duplicate keys found:
  age (30, 35)
  address -> street ("123 Main St", "321 Wall St")

Using json.load/s as is returning a standard Python dictionary will remove duplicates so I think we need a way to "stream" the json as it's loading in some depth first search / visitor pattern way.

I've also tried something similar to what was suggested here: https://stackoverflow.com/a/14902564/8878330 (quoted below)

def dict_raise_on_duplicates(ordered_pairs):
    """Reject duplicate keys."""
    d = {}
    for k, v in ordered_pairs:
        if k in d:
           raise ValueError("duplicate key: %r" % (k,))
        else:
           d[k] = v
    return d

The only change I made was instead of raising, I appended the duplicate key to a list so I can print the list of duplicate keys at the end.

The problem is I don't see a simple way to get the "full path" of the duplicate keys

答案1

得分: 0

我们使用 json.loads 方法的 object_pairs_hook 参数来检查同一字典中的所有键/值对并检查重复键。当发现重复键时,我们修改键名,将 #duplicate_key# 添加到它前面(我们假设没有原始键名以这些字符开头)。接下来,我们递归地遍历刚刚从 JSON 解析出来的对象,计算字典键的完整路径,并打印出我们发现的重复项的路径和值。

import json

DUPLICATE_MARKER = '#duplicate_key#'
DUPLICATE_MARKER_LENGTH = len(DUPLICATE_MARKER)

s = """{
    "name": "John",
    "age": 30,
    "address": {
        "street": "123 Main St",
        "city": "New York",
        "street": "321 Wall St"
    },
    "contacts": [
        {
            "type": "email",
            "value": "john@example.com"
        },
        {
            "type": "phone",
            "value": "555-1234"
        },
        {
            "type": "email",
            "value": "johndoe@example.com"
        }
    ],
    "age": 35
}"""

def my_hook(initial_pairs):
    s = set()
    pairs = []
    for pair in initial_pairs:
        k, v = pair
        if k in s:
            # 替换键名:
            k = DUPLICATE_MARKER + k
            pairs.append((k, v))
        else:
            s.add(k)
            pairs.append(pair)
    return dict(pairs)

def get_duplicates_path(o, path):
    if isinstance(o, list):
        for i, v in enumerate(o):
            get_duplicates_path(v, f'{path}[{i}]')
    elif isinstance(o, dict):
        for k, v in o.items():
            if k[:DUPLICATE_MARKER_LENGTH] == DUPLICATE_MARKER:
                print(f'duplicate key at {path}[{repr(k[DUPLICATE_MARKER_LENGTH:])}] with value {repr(v)}')
            else:
                get_duplicates_path(v, f'{path}[{repr(k)}]')

print(s)
obj = json.loads(s, object_pairs_hook=my_hook)
get_duplicates_path(obj, 'obj')

print()

# 另一个测试:

s = """[
   {
       "x": [{"a": 1, "b": 2, "c": 3}, {"a": 1, "b": 2, "a": 3}]
   },
   {
       "y": "z"
   }
]"""

print(s)
obj = json.loads(s, object_pairs_hook=my_hook)
get_duplicates_path(obj, 'obj')

打印结果:

{
"name": "John",
"age": 30,
"address": {
"street": "123 Main St",
"city": "New York",
"street": "321 Wall St"
},
"contacts": [
{
"type": "email",
"value": "john@example.com"
},
{
"type": "phone",
"value": "555-1234"
},
{
"type": "email",
"value": "johndoe@example.com"
}
],
"age": 35
}
duplicate key at obj['address']['street'] with value '321 Wall St'
duplicate key at obj['age'] with value 35
[
{
"x": [{"a": 1, "b": 2, "c": 3}, {"a": 1, "b": 2, "a": 3}]
},
{
"y": "z"
}
]
duplicate key at obj[0]['x'][1]['a'] with value 3
英文:

We use the object_pairs_hook argument of the json.loads method to inspect all key/value pairs within the same dictionary and check for duplicate keys. When a duplicate key is found, we modify the key name by prepending `#duplicate_key#' to it (we assume that no original key name begins with those characters). Next we recursively walk the resultant object that was just parsed from the JSON to compute the full paths of dictionary keys and print out the paths and values for the duplicates we discovered.

import json

DUPLICATE_MARKER = '#duplicate_key#'
DUPLICATE_MARKER_LENGTH = len(DUPLICATE_MARKER)

s = """{
    "name": "John",
    "age": 30,
    "address": {
        "street": "123 Main St",
        "city": "New York",
        "street": "321 Wall St"
    },
    "contacts": [
        {
            "type": "email",
            "value": "john@example.com"
        },
        {
            "type": "phone",
            "value": "555-1234"
        },
        {
            "type": "email",
            "value": "johndoe@example.com"
        }
    ],
    "age": 35
}"""

def my_hook(initial_pairs):
    s = set()
    pairs = []
    for pair in initial_pairs:
        k, v = pair
        if k in s:
            # Replace key name:
            k = DUPLICATE_MARKER + k
            pairs.append((k, v))
        else:
            s.add(k)
            pairs.append(pair)
    return dict(pairs)

def get_duplicates_path(o, path):
    if isinstance(o, list):
        for i, v in enumerate(o):
            get_duplicates_path(v, f'{path}[{i}]')
    elif isinstance(o, dict):
        for k, v in o.items():
            if k[:DUPLICATE_MARKER_LENGTH] == DUPLICATE_MARKER:
                print(f'duplicate key at {path}[{repr(k[DUPLICATE_MARKER_LENGTH:])}] with value {repr(v)}')
            else:
                get_duplicates_path(v, f'{path}[{repr(k)}]')

print(s)
obj = json.loads(s, object_pairs_hook=my_hook)
get_duplicates_path(obj, 'obj')

print()

# Another test:

s = """[
   {
       "x": [{"a": 1, "b": 2, "c": 3}, {"a": 1, "b": 2, "a": 3}]
   },
   {
       "y": "z"
   }
]"""

print(s)
obj = json.loads(s, object_pairs_hook=my_hook)
get_duplicates_path(obj, 'obj')

Prints:

{
"name": "John",
"age": 30,
"address": {
"street": "123 Main St",
"city": "New York",
"street": "321 Wall St"
},
"contacts": [
{
"type": "email",
"value": "john@example.com"
},
{
"type": "phone",
"value": "555-1234"
},
{
"type": "email",
"value": "johndoe@example.com"
}
],
"age": 35
}
duplicate key at obj['address']['street'] with value '321 Wall St'
duplicate key at obj['age'] with value 35
[
{
"x": [{"a": 1, "b": 2, "c": 3}, {"a": 1, "b": 2, "a": 3}]
},
{
"y": "z"
}
]
duplicate key at obj[0]['x'][1]['a'] with value 3

huangapple
  • 本文由 发表于 2023年6月8日 18:24:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76430893.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定