How to traverse entire JSON structure while appending 3 elements to a dictionary (or any other structure) that meet criteria?

huangapple go评论57阅读模式
英文:

How to traverse entire JSON structure while appending 3 elements to a dictionary (or any other structure) that meet criteria?

问题

对于你提供的JSON树结构,你想提取信息并以表格的形式返回数据,包括"main_text"、"sub_text"和"id"字段。你目前的代码不完全满足要求,需要对其进行改进。

以下是修改后的代码:

import pandas as pd

def extract_values_from_child_array(data):
    main_text_list = []
    sub_text_list = []
    id_list = []

    def extract_info(node, parent_text, parent_id):
        for item in node:
            if item["type"].startswith("H"):
                if "child" in item:
                    extract_info(item["child"], parent_text + " " + item["value"], item.get("id", ""))
                elif "type" in item and item["type"] != "HTML":
                    sub_text = item.get("value", "")
                    if "id" in item:
                        id_list.append(item["id"])
                    else:
                        id_list.append("")
                    main_text_list.append(parent_text.strip())
                    sub_text_list.append(sub_text)

    for obj in data:
        if "document_tree" in obj:
            extract_info(obj["document_tree"], "", "")
    
    result_data = pd.DataFrame({
        "main_text": main_text_list,
        "sub_text": sub_text_list,
        "id": id_list
    })

    return result_data

# 调用函数并返回数据
results1 = extract_values_from_child_array(data1)
print(results1)

此代码会遍历JSON树结构,提取满足你所描述逻辑的信息,然后将其存储在一个Pandas DataFrame中。现在,你可以使用DataFrame来进一步分析和处理数据。

英文:

Got a confusing json object.

I have list of dictionaries with below JSON tree structure with an example included.

I am trying to traverse the tree and pull out 3 pieces of information based on the following logic:

  1. for every 'document_tree' array in every object in the list, if 'type'
    starts with 'H', and if 'child' is present, then go to
    into 'child' array, and check if the next 'type' is equal to 'HTML'.

    • if 'type' in 'child' is equal to 'HTML' then concat the strings from 'value' of all the elements where 'type' elements equals 'HTML' while tracking the string of the 'id' of every 'value' being concated for every 'type' equal to 'HTML' and append it to the string of 'value' from the parent element. Open to any sort of
    • if 'type' in 'child' is not equal to 'HTML' ,then only record the string of 'value' in parent element, with "id" equal to empty string, if "id" is not present in element. If it is present, then record that value.

The code should return back data like this, and im open to any format or type to return back (dict, dataframe, etc). I will say that the data will eventually go into pandas, so any data format to do that easily would be appreciated. FYI I couldnt get the ID column to align:

   main_text                     sub_text                                          id
0  Dog, facts and photos
1  Domestic dog                  mostly kept as pets                        090b4d91 
2  Dog - Wikipedia               
3  Dog                           Domesticated canid species "\"Pooch\" For other \"Doggy\" Gmelin, 1792.       c6edc846,e1689ad9,0c95357e

My code currently is not traversing the full structure and not pulling out 'value' string where 'type' is not equal to 'HTML' of the parent and not grabbing the IDs, bc I dont know how to implement this and how to structure the data. The code is able to grab some parent and child values.

code:

def extract_values_from_child_array(data):
    results = {}
    for d in data:
        if "document_tree" in d:
            for t in d["document_tree"]:
                if t["type"].startswith("H"):
                    current_type = t["value"]
                    if "child" in t:
                        for c in t["child"]:
                            if c["type"].startswith("H"):
                                current_type = c["value"]
                                if current_type not in results:
                                    results[current_type] = ""
                            elif c["type"] == "HTML":
                                if current_type not in results:
                                    results[current_type] = ""
                                results[current_type] += c["value"]
                            if "child" in c:
                                for gc in c["child"]:
                                    if gc["type"] == "HTML":
                                        if current_type not in results:
                                            results[current_type] = ""
                                        results[current_type] += gc["value"]
    return results







results1 = extract_values_from_child_array(data1)

 tree: 
    ─  (array)
       └─  (object)
          ├─ "id" (string)
          ├─ "key" (string)
          ├─ "number" (number)
          ├─ "result_title" (string)
          ├─ "result_url" (string)
          ├─ "document_tree" (array)
          │  └─  (object)
          │     ├─ "type" (string)
          │     ├─ "value" (string)
          │     ├─ "child" (array)
          │     │  └─  (object)
          │     │     ├─ "type" (string)
          │     │     ├─ "value" (string)
          │     │     ├─ "child" (array)
          │     │     │  └─  (object)
          │     │     │     ├─ "type" (string)
          │     │     │     ├─ "key" (string)
          │     │     │     ├─ "id" (string)
          │     │     │     └─ "value" (string)
          │     │     ├─ "id" (string)
          │     │     └─ "key" (string)
          │     ├─ "id" (string)
          │     └─ "key" (string)
          ├─ "featured_image_url" (string)
          ├─ "hidden" (number)
          ├─ "domain" (string)
          └─ "result_preview_text" (string)
    
    example:
    
     [
          {
            "id": "1",
            "key": "example_1",
            "number": 1,
            "result_title": "Result Title 1",
            "result_url": "https://example.com/result_1",
            "document_tree": [
              {
                "type": "H1",
                "value": "Dog, facts and photos",
                "child": [
                  {
                    "type": "H1",
                    "value": "Domestic dog",
                    "child": [
                      {
                        "type": "HTML",
                        "key": "090b4d91",
                        "id": "090b4d91",
                        "value": "mostly kept as pets"
                      }
                    ],
                    "id": "1",
                    "key": "key_1"
                  }
                ],
                "featured_image_url": "https://example.com/featured_image_1.jpg",
                "hidden": 0,
                "domain": "example.com",
                "result_preview_text": "Result Preview Text 1"
              },
              {
                "id": "2",
                "key": "example_2",
                "number": 2,
                "result_title": "Result Title 2",
                "result_url": "https://example.com/result_2",
                "document_tree": [
                  {
                    "type": "H1",
                    "value": "Dog - Wikipedia",
                    "child": [
                      {
                        "type": "H1",
                        "value": "Dog",
                        "child": [
                          {
                            "type": "HTML",
                            "key": "c6edc846",
                            "id": "c6edc846",
                            "value": "Domesticated canid species"
                          },
                          {
                            "type": "HTML",
                            "key": "e1689ad9",
                            "id": "e1689ad9",
                            "value": "\"Pooch\" For other ."
                          },
                          {
                            "type": "HTML",
                            "key": "0c95357e",
                            "id": "0c95357e",
                            "value": "\"Doggy\" Gmelin, 1792"
                          }
                        ],
                        "featured_image_url": "https://example.com/featured_image_2.jpg",
                        "hidden": 1,
                        "domain": "example.com",
                        "result_preview_text": "Result Preview Text 2"
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]

答案1

得分: 1

截止目前(2023-02-23 22:05 PST),我认为问题中提供的示例是不正确的。下面的答案是基于以下示例。请注意,在以下示例中,“example_1”和“example_2”位于同一级,而在问题中,“example_2”包含在“example_1”中。

[
  {
    "id": "1",
    "key": "example_1",
    "number": 1,
    "result_title": "Result Title 1",
    "result_url": "https://example.com/result_1",
    "document_tree": [
      {
        "type": "H1",
        "value": "Dog, facts and photos",
        "child": [
          {
            "type": "H1",
            "value": "Domestic dog",
            "child": [
              {
                "type": "HTML",
                "key": "090b4d91",
                "id": "090b4d91",
                "value": "mostly kept as pets"
              }
            ],
            "id": "1",
            "key": "key_1"
          }
        ],
        "featured_image_url": "https://example.com/featured_image_1.jpg",
        "hidden": 0,
        "domain": "example.com",
        "result_preview_text": "Result Preview Text 1"
      }
    ]
  },
  {
    "id": "2",
    "key": "example_2",
    "number": 2,
    "result_title": "Result Title 2",
    "result_url": "https://example.com/result_2",
    "document_tree": [
      {
        "type": "H1",
        "value": "Dog - Wikipedia",
        "child": [
          {
            "type": "H1",
            "value": "Dog",
            "child": [
              {
                "type": "HTML",
                "key": "c6edc846",
                "id": "c6edc846",
                "value": "Domesticated canid species"
              },
              {
                "type": "HTML",
                "key": "e1689ad9",
                "id": "e1689ad9",
                "value": "\"Pooch\" For other ."
              },
              {
                "type": "HTML",
                "key": "0c95357e",
                "id": "0c95357e",
                "value": "\"Doggy\" Gmelin, 1792"
              }
            ],
            "featured_image_url": "https://example.com/featured_image_2.jpg",
            "hidden": 1,
            "domain": "example.com",
            "result_preview_text": "Result Preview Text 2"
          }
        ]
      }
    ]
  }
]

假设对示例进行的更正是有效的,下面显示了可能的解决方案(有关递归逻辑的详细信息,请参见文档字符串)。

from typing import Dict, Tuple
import pandas as pd

df_dict = {
    'main_text': [],
    'sub_text': [],
    'id': [],
}

def process_node(node: Dict) -> Tuple[str, str]:
    """处理每个父节点和子节点。

    逻辑是,如果节点的类型是HTML,它必须是左节点。我们只返回其值和ID。

    如果节点的类型以“H”开头,它可能包含HTML子节点或不包含。
    无论如何,我们都处理其子节点并获取其值和ID。

    如果子节点是HTML,我们将值和ID记录在本地数组中。
    否则,我们忽略它们。

    最终,我们将这些值和ID连接起来,并将它们提供给df_dict,以及当前节点的值。

    :param node: 文档树中的一个节点
    :type node: Dict
    :return: (值,ID)的HTML节点。否则,('','')作为虚拟值。
    :rtype: Tuple[str, str]
    """
    if node.get('type', '') == 'HTML':
        return node['value'], node['id']

    val_list = []
    id_list = []
    if node.get('type', '').startswith('H'):
        for child in node.get('child', []):
            child_val, child_id = process_node(child)
            if child['type'] == 'HTML':
                val_list.append(child_val)
                id_list.append(child_id)
        df_dict['main_text'].append(node.get('value', ''))
        df_dict['sub_text'].append(' '.join(val_list))
        df_dict['id'].append(','.join(id_list))
    return '', ''  # 返回虚拟值

for top_object in data:
    for root in top_object['document_tree']:
        process_node(root)

df = pd.DataFrame.from_dict(df_dict)

# 显示
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)
print(df)

输出:

               main_text                                                             sub_text                          id
0           Domestic dog                                                  mostly kept as pets                    090b4d91
1  Dog, facts and photos                                                                                                 
2                    Dog  Domesticated canid species "Pooch" For other . "Doggy" Gmelin, 1792  c6edc846,e1689ad9,0c95357e
3        Dog - Wikipedia                                                                                                                             
英文:

As of now (2023-02-23 22:05 PST), I think the example provided in the question is not correct. The answer below is based on the following example. Notice that in the following example, "example_1" and "example_2" are on the same level, whereas in the question, "example_2" is contained within "example_1".

[
{
"id": "1",
"key": "example_1",
"number": 1,
"result_title": "Result Title 1",
"result_url": "https://example.com/result_1",
"document_tree": [
{
"type": "H1",
"value": "Dog, facts and photos",
"child": [
{
"type": "H1",
"value": "Domestic dog",
"child": [
{
"type": "HTML",
"key": "090b4d91",
"id": "090b4d91",
"value": "mostly kept as pets"
}
],
"id": "1",
"key": "key_1"
}
],
"featured_image_url": "https://example.com/featured_image_1.jpg",
"hidden": 0,
"domain": "example.com",
"result_preview_text": "Result Preview Text 1"
}
]
},
{
"id": "2",
"key": "example_2",
"number": 2,
"result_title": "Result Title 2",
"result_url": "https://example.com/result_2",
"document_tree": [
{
"type": "H1",
"value": "Dog - Wikipedia",
"child": [
{
"type": "H1",
"value": "Dog",
"child": [
{
"type": "HTML",
"key": "c6edc846",
"id": "c6edc846",
"value": "Domesticated canid species"
},
{
"type": "HTML",
"key": "e1689ad9",
"id": "e1689ad9",
"value": "\"Pooch\" For other ."
},
{
"type": "HTML",
"key": "0c95357e",
"id": "0c95357e",
"value": "\"Doggy\" Gmelin, 1792"
}
],
"featured_image_url": "https://example.com/featured_image_2.jpg",
"hidden": 1,
"domain": "example.com",
"result_preview_text": "Result Preview Text 2"
}
]
}
]
}
]

Suppose the correction made to the example is valid, a possible solution is shown below (see the docstring for details in recursion logic).

from typing import Dict, Tuple
import pandas as pd

df_dict = {
    'main_text': [],
    'sub_text': [],
    'id': [],
}

def process_node(node: Dict) -> Tuple[str, str]:
    """Process each parent and child node.

    The logic is that if the node's type is HTML, it must be a left node. We
    simply return its value and ID.

    If the node's type starts with "H", it might contain HTML children or not.
    Regardless, we process its children and get their values and IDs.

    If the children are HTML, we record values and IDs in local arrays.
    Otherwise, we ignore them.

    Eventually, we concatenate the values and IDs, and supply them, along with
    the current node's value to df_dict.

    :param node: a node in document tree
    :type node: Dict
    :return: (value, ID) of an HTML node. Otherwise, ('', '') as dummy values.
    :rtype: Tuple[str, str]
    """
    if node.get('type', '') == 'HTML':
        return node['value'], node['id']

    val_list = []
    id_list = []
    if node.get('type', '').startswith('H'):        
        for child in node.get('child', []):
            child_val, child_id = process_node(child)
            if child['type'] == 'HTML':
                val_list.append(child_val)
                id_list.append(child_id)
        df_dict['main_text'].append(node.get('value', ''))
        df_dict['sub_text'].append(' '.join(val_list))
        df_dict['id'].append(','.join(id_list))
    return '', ''  # return dummy values
    

for top_object in data:
    for root in top_object['document_tree']:
        process_node(root)

df = pd.DataFrame.from_dict(df_dict)

# display
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)
print(df)

Output:

               main_text                                                             sub_text                          id
0           Domestic dog                                                  mostly kept as pets                    090b4d91
1  Dog, facts and photos                                                                                                 
2                    Dog  Domesticated canid species "Pooch" For other . "Doggy" Gmelin, 1792  c6edc846,e1689ad9,0c95357e
3        Dog - Wikipedia                                                                                                                             

huangapple
  • 本文由 发表于 2023年2月24日 09:16:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75551815.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定