How to traverse entire JSON structure while appending 3 elements to a dictionary (or any other structure) that meet criteria?

huangapple go评论69阅读模式

How to traverse entire JSON structure while appending 3 elements to a dictionary (or any other structure) that meet criteria?




import pandas as pd

def extract_values_from_child_array(data):
    main_text_list = []
    sub_text_list = []
    id_list = []

    def extract_info(node, parent_text, parent_id):
        for item in node:
            if item["type"].startswith("H"):
                if "child" in item:
                    extract_info(item["child"], parent_text + " " + item["value"], item.get("id", ""))
                elif "type" in item and item["type"] != "HTML":
                    sub_text = item.get("value", "")
                    if "id" in item:

    for obj in data:
        if "document_tree" in obj:
            extract_info(obj["document_tree"], "", "")
    result_data = pd.DataFrame({
        "main_text": main_text_list,
        "sub_text": sub_text_list,
        "id": id_list

    return result_data

# 调用函数并返回数据
results1 = extract_values_from_child_array(data1)

此代码会遍历JSON树结构,提取满足你所描述逻辑的信息,然后将其存储在一个Pandas DataFrame中。现在,你可以使用DataFrame来进一步分析和处理数据。


Got a confusing json object.

I have list of dictionaries with below JSON tree structure with an example included.

I am trying to traverse the tree and pull out 3 pieces of information based on the following logic:

  1. for every 'document_tree' array in every object in the list, if 'type'
    starts with 'H', and if 'child' is present, then go to
    into 'child' array, and check if the next 'type' is equal to 'HTML'.

    • if 'type' in 'child' is equal to 'HTML' then concat the strings from 'value' of all the elements where 'type' elements equals 'HTML' while tracking the string of the 'id' of every 'value' being concated for every 'type' equal to 'HTML' and append it to the string of 'value' from the parent element. Open to any sort of
    • if 'type' in 'child' is not equal to 'HTML' ,then only record the string of 'value' in parent element, with "id" equal to empty string, if "id" is not present in element. If it is present, then record that value.

The code should return back data like this, and im open to any format or type to return back (dict, dataframe, etc). I will say that the data will eventually go into pandas, so any data format to do that easily would be appreciated. FYI I couldnt get the ID column to align:

   main_text                     sub_text                                          id
0  Dog, facts and photos
1  Domestic dog                  mostly kept as pets                        090b4d91 
2  Dog - Wikipedia               
3  Dog                           Domesticated canid species "\"Pooch\" For other \"Doggy\" Gmelin, 1792.       c6edc846,e1689ad9,0c95357e

My code currently is not traversing the full structure and not pulling out 'value' string where 'type' is not equal to 'HTML' of the parent and not grabbing the IDs, bc I dont know how to implement this and how to structure the data. The code is able to grab some parent and child values.


def extract_values_from_child_array(data):
    results = {}
    for d in data:
        if "document_tree" in d:
            for t in d["document_tree"]:
                if t["type"].startswith("H"):
                    current_type = t["value"]
                    if "child" in t:
                        for c in t["child"]:
                            if c["type"].startswith("H"):
                                current_type = c["value"]
                                if current_type not in results:
                                    results[current_type] = ""
                            elif c["type"] == "HTML":
                                if current_type not in results:
                                    results[current_type] = ""
                                results[current_type] += c["value"]
                            if "child" in c:
                                for gc in c["child"]:
                                    if gc["type"] == "HTML":
                                        if current_type not in results:
                                            results[current_type] = ""
                                        results[current_type] += gc["value"]
    return results

results1 = extract_values_from_child_array(data1)

    ─  (array)
       └─  (object)
          ├─ "id" (string)
          ├─ "key" (string)
          ├─ "number" (number)
          ├─ "result_title" (string)
          ├─ "result_url" (string)
          ├─ "document_tree" (array)
          │  └─  (object)
          │     ├─ "type" (string)
          │     ├─ "value" (string)
          │     ├─ "child" (array)
          │     │  └─  (object)
          │     │     ├─ "type" (string)
          │     │     ├─ "value" (string)
          │     │     ├─ "child" (array)
          │     │     │  └─  (object)
          │     │     │     ├─ "type" (string)
          │     │     │     ├─ "key" (string)
          │     │     │     ├─ "id" (string)
          │     │     │     └─ "value" (string)
          │     │     ├─ "id" (string)
          │     │     └─ "key" (string)
          │     ├─ "id" (string)
          │     └─ "key" (string)
          ├─ "featured_image_url" (string)
          ├─ "hidden" (number)
          ├─ "domain" (string)
          └─ "result_preview_text" (string)
            "id": "1",
            "key": "example_1",
            "number": 1,
            "result_title": "Result Title 1",
            "result_url": "",
            "document_tree": [
                "type": "H1",
                "value": "Dog, facts and photos",
                "child": [
                    "type": "H1",
                    "value": "Domestic dog",
                    "child": [
                        "type": "HTML",
                        "key": "090b4d91",
                        "id": "090b4d91",
                        "value": "mostly kept as pets"
                    "id": "1",
                    "key": "key_1"
                "featured_image_url": "",
                "hidden": 0,
                "domain": "",
                "result_preview_text": "Result Preview Text 1"
                "id": "2",
                "key": "example_2",
                "number": 2,
                "result_title": "Result Title 2",
                "result_url": "",
                "document_tree": [
                    "type": "H1",
                    "value": "Dog - Wikipedia",
                    "child": [
                        "type": "H1",
                        "value": "Dog",
                        "child": [
                            "type": "HTML",
                            "key": "c6edc846",
                            "id": "c6edc846",
                            "value": "Domesticated canid species"
                            "type": "HTML",
                            "key": "e1689ad9",
                            "id": "e1689ad9",
                            "value": "\"Pooch\" For other ."
                            "type": "HTML",
                            "key": "0c95357e",
                            "id": "0c95357e",
                            "value": "\"Doggy\" Gmelin, 1792"
                        "featured_image_url": "",
                        "hidden": 1,
                        "domain": "",
                        "result_preview_text": "Result Preview Text 2"


得分: 1

截止目前(2023-02-23 22:05 PST),我认为问题中提供的示例是不正确的。下面的答案是基于以下示例。请注意,在以下示例中,“example_1”和“example_2”位于同一级,而在问题中,“example_2”包含在“example_1”中。

    "id": "1",
    "key": "example_1",
    "number": 1,
    "result_title": "Result Title 1",
    "result_url": "",
    "document_tree": [
        "type": "H1",
        "value": "Dog, facts and photos",
        "child": [
            "type": "H1",
            "value": "Domestic dog",
            "child": [
                "type": "HTML",
                "key": "090b4d91",
                "id": "090b4d91",
                "value": "mostly kept as pets"
            "id": "1",
            "key": "key_1"
        "featured_image_url": "",
        "hidden": 0,
        "domain": "",
        "result_preview_text": "Result Preview Text 1"
    "id": "2",
    "key": "example_2",
    "number": 2,
    "result_title": "Result Title 2",
    "result_url": "",
    "document_tree": [
        "type": "H1",
        "value": "Dog - Wikipedia",
        "child": [
            "type": "H1",
            "value": "Dog",
            "child": [
                "type": "HTML",
                "key": "c6edc846",
                "id": "c6edc846",
                "value": "Domesticated canid species"
                "type": "HTML",
                "key": "e1689ad9",
                "id": "e1689ad9",
                "value": "\"Pooch\" For other ."
                "type": "HTML",
                "key": "0c95357e",
                "id": "0c95357e",
                "value": "\"Doggy\" Gmelin, 1792"
            "featured_image_url": "",
            "hidden": 1,
            "domain": "",
            "result_preview_text": "Result Preview Text 2"


from typing import Dict, Tuple
import pandas as pd

df_dict = {
    'main_text': [],
    'sub_text': [],
    'id': [],

def process_node(node: Dict) -> Tuple[str, str]:





    :param node: 文档树中的一个节点
    :type node: Dict
    :return: (值,ID)的HTML节点。否则,('','')作为虚拟值。
    :rtype: Tuple[str, str]
    if node.get('type', '') == 'HTML':
        return node['value'], node['id']

    val_list = []
    id_list = []
    if node.get('type', '').startswith('H'):
        for child in node.get('child', []):
            child_val, child_id = process_node(child)
            if child['type'] == 'HTML':
        df_dict['main_text'].append(node.get('value', ''))
        df_dict['sub_text'].append(' '.join(val_list))
    return '', ''  # 返回虚拟值

for top_object in data:
    for root in top_object['document_tree']:

df = pd.DataFrame.from_dict(df_dict)

# 显示
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)


               main_text                                                             sub_text                          id
0           Domestic dog                                                  mostly kept as pets                    090b4d91
1  Dog, facts and photos                                                                                                 
2                    Dog  Domesticated canid species "Pooch" For other . "Doggy" Gmelin, 1792  c6edc846,e1689ad9,0c95357e
3        Dog - Wikipedia                                                                                                                             

As of now (2023-02-23 22:05 PST), I think the example provided in the question is not correct. The answer below is based on the following example. Notice that in the following example, "example_1" and "example_2" are on the same level, whereas in the question, "example_2" is contained within "example_1".

"id": "1",
"key": "example_1",
"number": 1,
"result_title": "Result Title 1",
"result_url": "",
"document_tree": [
"type": "H1",
"value": "Dog, facts and photos",
"child": [
"type": "H1",
"value": "Domestic dog",
"child": [
"type": "HTML",
"key": "090b4d91",
"id": "090b4d91",
"value": "mostly kept as pets"
"id": "1",
"key": "key_1"
"featured_image_url": "",
"hidden": 0,
"domain": "",
"result_preview_text": "Result Preview Text 1"
"id": "2",
"key": "example_2",
"number": 2,
"result_title": "Result Title 2",
"result_url": "",
"document_tree": [
"type": "H1",
"value": "Dog - Wikipedia",
"child": [
"type": "H1",
"value": "Dog",
"child": [
"type": "HTML",
"key": "c6edc846",
"id": "c6edc846",
"value": "Domesticated canid species"
"type": "HTML",
"key": "e1689ad9",
"id": "e1689ad9",
"value": "\"Pooch\" For other ."
"type": "HTML",
"key": "0c95357e",
"id": "0c95357e",
"value": "\"Doggy\" Gmelin, 1792"
"featured_image_url": "",
"hidden": 1,
"domain": "",
"result_preview_text": "Result Preview Text 2"

Suppose the correction made to the example is valid, a possible solution is shown below (see the docstring for details in recursion logic).

from typing import Dict, Tuple
import pandas as pd

df_dict = {
    'main_text': [],
    'sub_text': [],
    'id': [],

def process_node(node: Dict) -> Tuple[str, str]:
    """Process each parent and child node.

    The logic is that if the node's type is HTML, it must be a left node. We
    simply return its value and ID.

    If the node's type starts with "H", it might contain HTML children or not.
    Regardless, we process its children and get their values and IDs.

    If the children are HTML, we record values and IDs in local arrays.
    Otherwise, we ignore them.

    Eventually, we concatenate the values and IDs, and supply them, along with
    the current node's value to df_dict.

    :param node: a node in document tree
    :type node: Dict
    :return: (value, ID) of an HTML node. Otherwise, ('', '') as dummy values.
    :rtype: Tuple[str, str]
    if node.get('type', '') == 'HTML':
        return node['value'], node['id']

    val_list = []
    id_list = []
    if node.get('type', '').startswith('H'):        
        for child in node.get('child', []):
            child_val, child_id = process_node(child)
            if child['type'] == 'HTML':
        df_dict['main_text'].append(node.get('value', ''))
        df_dict['sub_text'].append(' '.join(val_list))
    return '', ''  # return dummy values

for top_object in data:
    for root in top_object['document_tree']:

df = pd.DataFrame.from_dict(df_dict)

# display
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)


               main_text                                                             sub_text                          id
0           Domestic dog                                                  mostly kept as pets                    090b4d91
1  Dog, facts and photos                                                                                                 
2                    Dog  Domesticated canid species "Pooch" For other . "Doggy" Gmelin, 1792  c6edc846,e1689ad9,0c95357e
3        Dog - Wikipedia                                                                                                                             

  • 本文由 发表于 2023年2月24日 09:16:59
  • 转载请务必保留本文链接:



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
