将JSON对象递归地进行性能分析

huangapple go评论47阅读模式
英文:

jq to recursively profile JSON object

问题

我有一些巨大的JSON文件,需要对它们进行分析,以便将它们转换为一些表格。我发现jq在检查它们时非常有用,但将会有数百个这样的文件,而我对jq还不太熟悉。

我已经在我的~/.jq文件中添加了一些非常方便的函数(特别感谢@mikehwang)。

def profile_object:
    to_entries | def parse_entry: {"key": .key, "value": .value | type}; map(parse_entry)
        | sort_by(.key) | from_entries;

def profile_array_objects:
    map(profile_object) | map(to_entries) | reduce .[] as $item ([]; . + $item) | sort_by(.key) | from_entries;

我肯定在描述问题后需要对它们进行修改。

我希望有一个jq命令来分析单个对象。如果一个键映射到对象数组,那么收集对象中的唯一键,并在有嵌套对象数组的情况下继续分析。如果值是一个对象,则分析该对象。

抱歉示例太长,但想象一下有几GB的数据。

所需输出:

{
    "name": "string",
    "type": "string",
    "reporting": [
      {
        "group_id": "number",
        "groups": [
            {
                "ids": ["number"],
                "market": {
                    "type": "string",
                    "value": "string"
                }
            }
        ]
      }
    ],
    "product_agreements": [
      {
        "negotiation_arrangement": "string",
        "code": "string",
        "type": "string",
        "type_version": "string",
        "description": "string",
        "name": "string",
        "negotiated_rates": [
          {
            "company_references": ["number"],
            "negotiated_prices": [
              {
                "type": "string",
                "rate": "number",
                "expiration_date": "string",
                "code": ["string"],
                "billing_modifier_code": ["string"],
                "billing_class": "string"
              }
            ]
          }
        ]        
      }
    ],
    "version": "string",
    "last_updated_on": "string"
}

如果这有任何错误,真的很抱歉,但我尽量使它保持一致和尽可能简单。

为了重新阐述需求,如果值是对象或数组,递归地分析JSON对象中的每个键。解决方案需要独立于键名。如果需要更多的澄清,我很愿意提供进一步的解释。

英文:

I have some huge JSON files I need to profile so I can transform them into some tables. I found jq to be really useful in inspecting them, but there are going to be hundreds of these, and I'm pretty new to jq.

I already have some really handy functions in my ~/.jq (big thank you to @mikehwang)

def profile_object:
    to_entries | def parse_entry: {"key": .key, "value": .value | type}; map(parse_entry)
        | sort_by(.key) | from_entries;

def profile_array_objects:
    map(profile_object) | map(to_entries) | reduce .[] as $item ([]; . + $item) | sort_by(.key) | from_entries;

I'm sure I'll have to modify them after I describe my question.

I'd like a jq line to profile a single object. If a key maps to an array of objects then collect the unique keys across the objects and keep profiling down if there are nested arrays of objects there. If a value is an object, profile that object.

Sorry for the long example, but imagine several GBs of this:

{
    "name": "XYZ Company",
    "type": "Contractors",
    "reporting": [
        {
            "group_id": "660",
            "groups": [
                {
                    "ids": [
                        987654321,
                        987654321,
                        987654321
                    ],   
                    "market": {
                        "name": "Austin, TX",
                        "value": "873275"
                    }
                },
                {
                    "ids": [
                        987654321,
                        987654321,
                        987654321
                    ],   
                    "market": {
                        "name": "Nashville, TN",
                        "value": "2393287"
                    }
                }
            ]
        }
    ],
    "product_agreements": [
        {
            "negotiation_arrangement": "FFVII",
            "code": "84144",
            "type": "DJ",
            "type_version": "V10",
            "description": "DJ in a mask",
            "name": "Claptone",
            "negotiated_rates": [
                {
                    "company_references": [
                        1,
                        5,
                        458
                    ],
                    "negotiated_prices": [
                        {
                            "type": "negotiated",
                            "rate": 17.73,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_modifier_code": [
                                "124"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                },
                {
                    "company_references": [
                        747
                    ],
                    "negotiated_prices": [
                        {
                            "type": "fee",
                            "rate": 28.42,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                }
            ]
        },
        {
            "negotiation_arrangement": "MGS3",
            "name": "David Byrne",
            "type": "Producer",
            "type_version": "V10",
            "code": "654321",
            "description": "Frontman from Talking Heads",
            "negotiated_rates": [
                {
                    "company_references": [
                        1,
                        9,
                        2344,
                        8456
                    ],
                    "negotiated_prices": [
                        {
                            "type": "negotiated",
                            "rate": 68.73,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                },
                {
                    "company_references": [
                        679
                    ],
                    "negotiated_prices": [
                        {
                            "type": "fee",
                            "rate": 89.25,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                }
            ]
        }
    ],
    "version": "1.3.1",
    "last_updated_on": "2023-02-01"
}

Desired output:

{
    "name": "string",
    "type": "string",
    "reporting": [
      {
        "group_id": "number",
        "groups": [
            {
                "ids": [
                    "number"
                ],
                "market": {
                    "type": "string",
                    "value": "string"
                }
            }
        ]
      }
    ],
    "product_agreements": [
      {
        "negotiation_arrangement": "string",
        "code": "string",
        "type": "string",
        "type_version": "string",
        "description": "string",
        "name": "string",
        "negotiated_rates": [
          {
            "company_references": [
                "number"
            ],
            "negotiated_prices": [
              {
                "type": "string",
                "rate": "number",
                "expiration_date": "string",
                "code": [
                  "string"
                ],
                "billing_modifier_code": [
                  "string"
                ],
                "billing_class": "string"
              }
            ]
          }
        ]        
      }
    ],
    "version": "string",
    "last_updated_on": "string"
}

Really sorry if there's any errors in that, but I tried to make it all consistent and about as simple as I could.

To restate the need, recursively profile each key in a JSON object if a value is an object or array. Solution needs to be key name independent. Happily to clarify further if needed.

答案1

得分: 2

根据你的 input.json,这是一个解决方案:

```shell
jq '
def schema:
    if   type == "object" then .[] |= schema
    elif type == "array"  then map(schema)|unique
         | if (first | type) == "object" then [add] else . end
    else type
    end;
schema
' input.json
英文:

Given your input.json, here is a solution :

jq '
def schema:
    if   type == "object" then .[] |= schema
    elif type == "array"  then map(schema)|unique
         | if (first | type) == "object" then [add] else . end
    else type
    end;
schema
' input.json

答案2

得分: 1

jq模块schema.jq位于https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed
旨在生成您描述的结构模式。

对于非常大的输入,它可能会非常慢,因此如果JSON足够规则化,可能可以使用混合策略 - 对足够量的数据进行分析,以得出全面的结构模式,然后检查它是否适用。

有关由schema.jq生成的结构模式的一致性测试,请参见https://github.com/pkoppstein/JESS
英文:

The jq module schema.jq at https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed
Was designed to produce the kind of structural schema you describe.

For very large inputs, it might be very slow, so if the JSON is sufficiently regular, it might be possible to use a hybrid strategy - profiling enough of the data to come up with a comprehensive structural schema, and then checking that it does apply.

For conformance testing of structural schemas such as produced by schema.jq, see https://github.com/pkoppstein/JESS

答案3

得分: 0

这是 @Philippe 解决方案的一个变体:它以一种原则性但有损失的方式将数组中的对象合并到 map(schema) 中。(所有这些半成品解决方案都是以速度换取精度的权衡。)

请注意,在下面使用了 keys_unsorted;如果使用 gojq,则必须将其更改为 keys,或者提供一个未排序的键的定义。

#  "JSON" 用作两种不同类型的联合
# 除非 combine([]; [ $x ]) => [ $x ]
def combine($a;$b):
  if $a == $b then $a elif $a == null then $b elif $b == null then $a
  elif ($a == []) and ($b|type) == "array" then $b
  elif ($b == []) and ($a|type) == "array" then $a
  else "JSON"
  end;

# 通过调用 mergeTypes(.[] | schema) 对数组进行分析以进行对象合并
def mergeTypes(s):
    reduce s as $t (null;
       if ($t|type) != "object" then .types = (.types + [$t] | unique)
       else .object as $o
       | .object = reduce ($t | keys_unsorted[]) as $k ($o;
                    .[$k] = combine( $t[$k]; $o[$k] ) 
		  )
       end)
       | (if .object then [.object] else null end ) + .types ;

def schema:
    if   type == "object" then .[] |= schema
    elif type == "array"
    then if . == [] then [] else mergeTypes(.[] | schema) end
    else type
    end;
schema

示例:
输入:

{"a": [{"b":[1]}, {"c":[2]}, {"c": []}] }

输出:

{
  "a": [
    {
      "b": [
        "number"
      ],
      "c": [
        "number"
      ]
    }
  ]
}
英文:

Here's a variant of @Philippe's solution: it coalesces objects in map(schema) for arrays in a principled though lossy way. (All these half-solutions trade speed for loss of precision.)

Note that keys_unsorted is used below; if using gojq, then either this would have to be changed to keys, or a def of keys_unsorted provided.

# Use "JSON" as the union of two distinct types
# except combine([]; [ $x ]) => [ $x ]
def combine($a;$b):
if $a == $b then $a elif $a == null then $b elif $b == null then $a
elif ($a == []) and ($b|type) == "array" then $b
elif ($b == []) and ($a|type) == "array" then $a
else "JSON"
end;
# Profile an array by calling mergeTypes(.[] | schema)
# in order to coalesce objects
def mergeTypes(s):
reduce s as $t (null;
if ($t|type) != "object" then .types = (.types + [$t] | unique)
else .object as $o
| .object = reduce ($t | keys_unsorted[]) as $k ($o;
.[$k] = combine( $t[$k]; $o[$k] ) 
)
end)
| (if .object then [.object] else null end ) + .types ;
def schema:
if   type == "object" then .[] |= schema
elif type == "array"
then if . == [] then [] else mergeTypes(.[] | schema) end
else type
end;
schema

Example:
Input:

{"a": [{"b":[1]}, {"c":[2]}, {"c": []}] }

Output:

{
"a": [
{
"b": [
"number"
],
"c": [
"number"
]
}
]
}

huangapple
  • 本文由 发表于 2023年2月20日 00:11:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/75501520.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定