2023年2月24日 09:16:59go评论80阅读模式

英文:

How to traverse entire JSON structure while appending 3 elements to a dictionary (or any other structure) that meet criteria?

问题

对于你提供的JSON树结构，你想提取信息并以表格的形式返回数据，包括"main_text"、"sub_text"和"id"字段。你目前的代码不完全满足要求，需要对其进行改进。

以下是修改后的代码：

import pandas as pd
def extract_values_from_child_array(data):
    main_text_list = []
    sub_text_list = []
    id_list = []
    def extract_info(node, parent_text, parent_id):
        for item in node:
            if item["type"].startswith("H"):
                if "child" in item:
                    extract_info(item["child"], parent_text + " " + item["value"], item.get("id", ""))
                elif "type" in item and item["type"] != "HTML":
                    sub_text = item.get("value", "")
                    if "id" in item:
                        id_list.append(item["id"])
                    else:
                        id_list.append("")
                    main_text_list.append(parent_text.strip())
                    sub_text_list.append(sub_text)
    for obj in data:
        if "document_tree" in obj:
            extract_info(obj["document_tree"], "", "")
    
    result_data = pd.DataFrame({
        "main_text": main_text_list,
        "sub_text": sub_text_list,
        "id": id_list
    })
    return result_data
# 调用函数并返回数据
results1 = extract_values_from_child_array(data1)
print(results1)

此代码会遍历JSON树结构，提取满足你所描述逻辑的信息，然后将其存储在一个Pandas DataFrame中。现在，你可以使用DataFrame来进一步分析和处理数据。

英文:

Got a confusing json object.

I have list of dictionaries with below JSON tree structure with an example included.

I am trying to traverse the tree and pull out 3 pieces of information based on the following logic:

for every 'document_tree' array in every object in the list, if 'type'
starts with 'H', and if 'child' is present, then go to
into 'child' array, and check if the next 'type' is equal to 'HTML'.
- if 'type' in 'child' is equal to 'HTML' then concat the strings from 'value' of all the elements where 'type' elements equals 'HTML' while tracking the string of the 'id' of every 'value' being concated for every 'type' equal to 'HTML' and append it to the string of 'value' from the parent element. Open to any sort of
- if 'type' in 'child' is not equal to 'HTML' ,then only record the string of 'value' in parent element, with "id" equal to empty string, if "id" is not present in element. If it is present, then record that value.

The code should return back data like this, and im open to any format or type to return back (dict, dataframe, etc). I will say that the data will eventually go into pandas, so any data format to do that easily would be appreciated. FYI I couldnt get the ID column to align:

   main_text                     sub_text                                          id
0  Dog, facts and photos
1  Domestic dog                  mostly kept as pets                        090b4d91 
2  Dog - Wikipedia               
3  Dog                           Domesticated canid species &quot;\&quot;Pooch\&quot; For other \&quot;Doggy\&quot; Gmelin, 1792.       c6edc846,e1689ad9,0c95357e

My code currently is not traversing the full structure and not pulling out 'value' string where 'type' is not equal to 'HTML' of the parent and not grabbing the IDs, bc I dont know how to implement this and how to structure the data. The code is able to grab some parent and child values.

code:

def extract_values_from_child_array(data):
    results = {}
    for d in data:
        if &quot;document_tree&quot; in d:
            for t in d[&quot;document_tree&quot;]:
                if t[&quot;type&quot;].startswith(&quot;H&quot;):
                    current_type = t[&quot;value&quot;]
                    if &quot;child&quot; in t:
                        for c in t[&quot;child&quot;]:
                            if c[&quot;type&quot;].startswith(&quot;H&quot;):
                                current_type = c[&quot;value&quot;]
                                if current_type not in results:
                                    results[current_type] = &quot;&quot;
                            elif c[&quot;type&quot;] == &quot;HTML&quot;:
                                if current_type not in results:
                                    results[current_type] = &quot;&quot;
                                results[current_type] += c[&quot;value&quot;]
                            if &quot;child&quot; in c:
                                for gc in c[&quot;child&quot;]:
                                    if gc[&quot;type&quot;] == &quot;HTML&quot;:
                                        if current_type not in results:
                                            results[current_type] = &quot;&quot;
                                        results[current_type] += gc[&quot;value&quot;]
    return results
results1 = extract_values_from_child_array(data1)


 tree: 
    ─  (array)
       └─  (object)
          ├─ &quot;id&quot; (string)
          ├─ &quot;key&quot; (string)
          ├─ &quot;number&quot; (number)
          ├─ &quot;result_title&quot; (string)
          ├─ &quot;result_url&quot; (string)
          ├─ &quot;document_tree&quot; (array)
          │  └─  (object)
          │     ├─ &quot;type&quot; (string)
          │     ├─ &quot;value&quot; (string)
          │     ├─ &quot;child&quot; (array)
          │     │  └─  (object)
          │     │     ├─ &quot;type&quot; (string)
          │     │     ├─ &quot;value&quot; (string)
          │     │     ├─ &quot;child&quot; (array)
          │     │     │  └─  (object)
          │     │     │     ├─ &quot;type&quot; (string)
          │     │     │     ├─ &quot;key&quot; (string)
          │     │     │     ├─ &quot;id&quot; (string)
          │     │     │     └─ &quot;value&quot; (string)
          │     │     ├─ &quot;id&quot; (string)
          │     │     └─ &quot;key&quot; (string)
          │     ├─ &quot;id&quot; (string)
          │     └─ &quot;key&quot; (string)
          ├─ &quot;featured_image_url&quot; (string)
          ├─ &quot;hidden&quot; (number)
          ├─ &quot;domain&quot; (string)
          └─ &quot;result_preview_text&quot; (string)
    
    example:
    
     [
          {
            &quot;id&quot;: &quot;1&quot;,
            &quot;key&quot;: &quot;example_1&quot;,
            &quot;number&quot;: 1,
            &quot;result_title&quot;: &quot;Result Title 1&quot;,
            &quot;result_url&quot;: &quot;https://example.com/result_1&quot;,
            &quot;document_tree&quot;: [
              {
                &quot;type&quot;: &quot;H1&quot;,
                &quot;value&quot;: &quot;Dog, facts and photos&quot;,
                &quot;child&quot;: [
                  {
                    &quot;type&quot;: &quot;H1&quot;,
                    &quot;value&quot;: &quot;Domestic dog&quot;,
                    &quot;child&quot;: [
                      {
                        &quot;type&quot;: &quot;HTML&quot;,
                        &quot;key&quot;: &quot;090b4d91&quot;,
                        &quot;id&quot;: &quot;090b4d91&quot;,
                        &quot;value&quot;: &quot;mostly kept as pets&quot;
                      }
                    ],
                    &quot;id&quot;: &quot;1&quot;,
                    &quot;key&quot;: &quot;key_1&quot;
                  }
                ],
                &quot;featured_image_url&quot;: &quot;https://example.com/featured_image_1.jpg&quot;,
                &quot;hidden&quot;: 0,
                &quot;domain&quot;: &quot;example.com&quot;,
                &quot;result_preview_text&quot;: &quot;Result Preview Text 1&quot;
              },
              {
                &quot;id&quot;: &quot;2&quot;,
                &quot;key&quot;: &quot;example_2&quot;,
                &quot;number&quot;: 2,
                &quot;result_title&quot;: &quot;Result Title 2&quot;,
                &quot;result_url&quot;: &quot;https://example.com/result_2&quot;,
                &quot;document_tree&quot;: [
                  {
                    &quot;type&quot;: &quot;H1&quot;,
                    &quot;value&quot;: &quot;Dog - Wikipedia&quot;,
                    &quot;child&quot;: [
                      {
                        &quot;type&quot;: &quot;H1&quot;,
                        &quot;value&quot;: &quot;Dog&quot;,
                        &quot;child&quot;: [
                          {
                            &quot;type&quot;: &quot;HTML&quot;,
                            &quot;key&quot;: &quot;c6edc846&quot;,
                            &quot;id&quot;: &quot;c6edc846&quot;,
                            &quot;value&quot;: &quot;Domesticated canid species&quot;
                          },
                          {
                            &quot;type&quot;: &quot;HTML&quot;,
                            &quot;key&quot;: &quot;e1689ad9&quot;,
                            &quot;id&quot;: &quot;e1689ad9&quot;,
                            &quot;value&quot;: &quot;\&quot;Pooch\&quot; For other .&quot;
                          },
                          {
                            &quot;type&quot;: &quot;HTML&quot;,
                            &quot;key&quot;: &quot;0c95357e&quot;,
                            &quot;id&quot;: &quot;0c95357e&quot;,
                            &quot;value&quot;: &quot;\&quot;Doggy\&quot; Gmelin, 1792&quot;
                          }
                        ],
                        &quot;featured_image_url&quot;: &quot;https://example.com/featured_image_2.jpg&quot;,
                        &quot;hidden&quot;: 1,
                        &quot;domain&quot;: &quot;example.com&quot;,
                        &quot;result_preview_text&quot;: &quot;Result Preview Text 2&quot;
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]

答案1

得分: 1

截止目前（2023-02-23 22:05 PST），我认为问题中提供的示例是不正确的。下面的答案是基于以下示例。请注意，在以下示例中，“example_1”和“example_2”位于同一级，而在问题中，“example_2”包含在“example_1”中。

[
  {
    "id": "1",
    "key": "example_1",
    "number": 1,
    "result_title": "Result Title 1",
    "result_url": "https://example.com/result_1",
    "document_tree": [
      {
        "type": "H1",
        "value": "Dog, facts and photos",
        "child": [
          {
            "type": "H1",
            "value": "Domestic dog",
            "child": [
              {
                "type": "HTML",
                "key": "090b4d91",
                "id": "090b4d91",
                "value": "mostly kept as pets"
              }
            ],
            "id": "1",
            "key": "key_1"
          }
        ],
        "featured_image_url": "https://example.com/featured_image_1.jpg",
        "hidden": 0,
        "domain": "example.com",
        "result_preview_text": "Result Preview Text 1"
      }
    ]
  },
  {
    "id": "2",
    "key": "example_2",
    "number": 2,
    "result_title": "Result Title 2",
    "result_url": "https://example.com/result_2",
    "document_tree": [
      {
        "type": "H1",
        "value": "Dog - Wikipedia",
        "child": [
          {
            "type": "H1",
            "value": "Dog",
            "child": [
              {
                "type": "HTML",
                "key": "c6edc846",
                "id": "c6edc846",
                "value": "Domesticated canid species"
              },
              {
                "type": "HTML",
                "key": "e1689ad9",
                "id": "e1689ad9",
                "value": "\"Pooch\" For other ."
              },
              {
                "type": "HTML",
                "key": "0c95357e",
                "id": "0c95357e",
                "value": "\"Doggy\" Gmelin, 1792"
              }
            ],
            "featured_image_url": "https://example.com/featured_image_2.jpg",
            "hidden": 1,
            "domain": "example.com",
            "result_preview_text": "Result Preview Text 2"
          }
        ]
      }
    ]
  }
]

假设对示例进行的更正是有效的，下面显示了可能的解决方案（有关递归逻辑的详细信息，请参见文档字符串）。

from typing import Dict, Tuple
import pandas as pd
df_dict = {
    'main_text': [],
    'sub_text': [],
    'id': [],
}
def process_node(node: Dict) -> Tuple[str, str]:
    """处理每个父节点和子节点。
    逻辑是，如果节点的类型是HTML，它必须是左节点。我们只返回其值和ID。
    如果节点的类型以“H”开头，它可能包含HTML子节点或不包含。
    无论如何，我们都处理其子节点并获取其值和ID。
    如果子节点是HTML，我们将值和ID记录在本地数组中。
    否则，我们忽略它们。
    最终，我们将这些值和ID连接起来，并将它们提供给df_dict，以及当前节点的值。
    :param node: 文档树中的一个节点
    :type node: Dict
    :return: （值，ID）的HTML节点。否则，（''，''）作为虚拟值。
    :rtype: Tuple[str, str]
    """
    if node.get('type', '') == 'HTML':
        return node['value'], node['id']
    val_list = []
    id_list = []
    if node.get('type', '').startswith('H'):
        for child in node.get('child', []):
            child_val, child_id = process_node(child)
            if child['type'] == 'HTML':
                val_list.append(child_val)
                id_list.append(child_id)
        df_dict['main_text'].append(node.get('value', ''))
        df_dict['sub_text'].append(' '.join(val_list))
        df_dict['id'].append(','.join(id_list))
    return '', ''  # 返回虚拟值
for top_object in data:
    for root in top_object['document_tree']:
        process_node(root)
df = pd.DataFrame.from_dict(df_dict)
# 显示
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)
print(df)

输出：

               main_text                                                             sub_text                          id
0           Domestic dog                                                  mostly kept as pets                    090b4d91
1  Dog, facts and photos                                                                                                 
2                    Dog  Domesticated canid species "Pooch" For other . "Doggy" Gmelin, 1792  c6edc846,e1689ad9,0c95357e
3        Dog - Wikipedia

英文:

As of now (2023-02-23 22:05 PST), I think the example provided in the question is not correct. The answer below is based on the following example. Notice that in the following example, "example_1" and "example_2" are on the same level, whereas in the question, "example_2" is contained within "example_1".

[
{
&quot;id&quot;: &quot;1&quot;,
&quot;key&quot;: &quot;example_1&quot;,
&quot;number&quot;: 1,
&quot;result_title&quot;: &quot;Result Title 1&quot;,
&quot;result_url&quot;: &quot;https://example.com/result_1&quot;,
&quot;document_tree&quot;: [
{
&quot;type&quot;: &quot;H1&quot;,
&quot;value&quot;: &quot;Dog, facts and photos&quot;,
&quot;child&quot;: [
{
&quot;type&quot;: &quot;H1&quot;,
&quot;value&quot;: &quot;Domestic dog&quot;,
&quot;child&quot;: [
{
&quot;type&quot;: &quot;HTML&quot;,
&quot;key&quot;: &quot;090b4d91&quot;,
&quot;id&quot;: &quot;090b4d91&quot;,
&quot;value&quot;: &quot;mostly kept as pets&quot;
}
],
&quot;id&quot;: &quot;1&quot;,
&quot;key&quot;: &quot;key_1&quot;
}
],
&quot;featured_image_url&quot;: &quot;https://example.com/featured_image_1.jpg&quot;,
&quot;hidden&quot;: 0,
&quot;domain&quot;: &quot;example.com&quot;,
&quot;result_preview_text&quot;: &quot;Result Preview Text 1&quot;
}
]
},
{
&quot;id&quot;: &quot;2&quot;,
&quot;key&quot;: &quot;example_2&quot;,
&quot;number&quot;: 2,
&quot;result_title&quot;: &quot;Result Title 2&quot;,
&quot;result_url&quot;: &quot;https://example.com/result_2&quot;,
&quot;document_tree&quot;: [
{
&quot;type&quot;: &quot;H1&quot;,
&quot;value&quot;: &quot;Dog - Wikipedia&quot;,
&quot;child&quot;: [
{
&quot;type&quot;: &quot;H1&quot;,
&quot;value&quot;: &quot;Dog&quot;,
&quot;child&quot;: [
{
&quot;type&quot;: &quot;HTML&quot;,
&quot;key&quot;: &quot;c6edc846&quot;,
&quot;id&quot;: &quot;c6edc846&quot;,
&quot;value&quot;: &quot;Domesticated canid species&quot;
},
{
&quot;type&quot;: &quot;HTML&quot;,
&quot;key&quot;: &quot;e1689ad9&quot;,
&quot;id&quot;: &quot;e1689ad9&quot;,
&quot;value&quot;: &quot;\&quot;Pooch\&quot; For other .&quot;
},
{
&quot;type&quot;: &quot;HTML&quot;,
&quot;key&quot;: &quot;0c95357e&quot;,
&quot;id&quot;: &quot;0c95357e&quot;,
&quot;value&quot;: &quot;\&quot;Doggy\&quot; Gmelin, 1792&quot;
}
],
&quot;featured_image_url&quot;: &quot;https://example.com/featured_image_2.jpg&quot;,
&quot;hidden&quot;: 1,
&quot;domain&quot;: &quot;example.com&quot;,
&quot;result_preview_text&quot;: &quot;Result Preview Text 2&quot;
}
]
}
]
}
]

Suppose the correction made to the example is valid, a possible solution is shown below (see the docstring for details in recursion logic).

from typing import Dict, Tuple
import pandas as pd
df_dict = {
    &#39;main_text&#39;: [],
    &#39;sub_text&#39;: [],
    &#39;id&#39;: [],
}
def process_node(node: Dict) -&gt; Tuple[str, str]:
    &quot;&quot;&quot;Process each parent and child node.
    The logic is that if the node&#39;s type is HTML, it must be a left node. We
    simply return its value and ID.
    If the node&#39;s type starts with &quot;H&quot;, it might contain HTML children or not.
    Regardless, we process its children and get their values and IDs.
    If the children are HTML, we record values and IDs in local arrays.
    Otherwise, we ignore them.
    Eventually, we concatenate the values and IDs, and supply them, along with
    the current node&#39;s value to df_dict.
    :param node: a node in document tree
    :type node: Dict
    :return: (value, ID) of an HTML node. Otherwise, (&#39;&#39;, &#39;&#39;) as dummy values.
    :rtype: Tuple[str, str]
    &quot;&quot;&quot;
    if node.get(&#39;type&#39;, &#39;&#39;) == &#39;HTML&#39;:
        return node[&#39;value&#39;], node[&#39;id&#39;]
    val_list = []
    id_list = []
    if node.get(&#39;type&#39;, &#39;&#39;).startswith(&#39;H&#39;):        
        for child in node.get(&#39;child&#39;, []):
            child_val, child_id = process_node(child)
            if child[&#39;type&#39;] == &#39;HTML&#39;:
                val_list.append(child_val)
                id_list.append(child_id)
        df_dict[&#39;main_text&#39;].append(node.get(&#39;value&#39;, &#39;&#39;))
        df_dict[&#39;sub_text&#39;].append(&#39; &#39;.join(val_list))
        df_dict[&#39;id&#39;].append(&#39;,&#39;.join(id_list))
    return &#39;&#39;, &#39;&#39;  # return dummy values
    
for top_object in data:
    for root in top_object[&#39;document_tree&#39;]:
        process_node(root)
df = pd.DataFrame.from_dict(df_dict)
# display
pd.set_option(&#39;display.max_colwidth&#39;, None)
pd.set_option(&#39;display.width&#39;, 1000)
print(df)

Output:

               main_text                                                             sub_text                          id
0           Domestic dog                                                  mostly kept as pets                    090b4d91
1  Dog, facts and photos                                                                                                 
2                    Dog  Domesticated canid species &quot;Pooch&quot; For other . &quot;Doggy&quot; Gmelin, 1792  c6edc846,e1689ad9,0c95357e
3        Dog - Wikipedia

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to traverse entire JSON structure while appending 3 elements to a dictionary (or any other structure) that meet criteria?

问题

答案1

what means two curly brackets in a row in structure 'struct' in Go lang?

如何在`pd.groupby()`中插值缺失的年份？

Is modifying the key set of a map okay, or is it an abuse of a map in Java with undefined behavior?

如何创建一个字典来统计两个文件中值（用户名）的出现次数？ – Python

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。