英文:
How to traverse entire JSON structure while appending 3 elements to a dictionary (or any other structure) that meet criteria?
问题
对于你提供的JSON树结构,你想提取信息并以表格的形式返回数据,包括"main_text"、"sub_text"和"id"字段。你目前的代码不完全满足要求,需要对其进行改进。
以下是修改后的代码:
import pandas as pd
def extract_values_from_child_array(data):
main_text_list = []
sub_text_list = []
id_list = []
def extract_info(node, parent_text, parent_id):
for item in node:
if item["type"].startswith("H"):
if "child" in item:
extract_info(item["child"], parent_text + " " + item["value"], item.get("id", ""))
elif "type" in item and item["type"] != "HTML":
sub_text = item.get("value", "")
if "id" in item:
id_list.append(item["id"])
else:
id_list.append("")
main_text_list.append(parent_text.strip())
sub_text_list.append(sub_text)
for obj in data:
if "document_tree" in obj:
extract_info(obj["document_tree"], "", "")
result_data = pd.DataFrame({
"main_text": main_text_list,
"sub_text": sub_text_list,
"id": id_list
})
return result_data
# 调用函数并返回数据
results1 = extract_values_from_child_array(data1)
print(results1)
此代码会遍历JSON树结构,提取满足你所描述逻辑的信息,然后将其存储在一个Pandas DataFrame中。现在,你可以使用DataFrame来进一步分析和处理数据。
英文:
Got a confusing json object.
I have list of dictionaries with below JSON tree structure with an example included.
I am trying to traverse the tree and pull out 3 pieces of information based on the following logic:
- for every 'document_tree' array in every object in the list, if 'type'
starts with 'H', and if 'child' is present, then go to
into 'child' array, and check if the next 'type' is equal to 'HTML'.- if 'type' in 'child' is equal to 'HTML' then concat the strings from 'value' of all the elements where 'type' elements equals 'HTML' while tracking the string of the 'id' of every 'value' being concated for every 'type' equal to 'HTML' and append it to the string of 'value' from the parent element. Open to any sort of
- if 'type' in 'child' is not equal to 'HTML' ,then only record the string of 'value' in parent element, with "id" equal to empty string, if "id" is not present in element. If it is present, then record that value.
The code should return back data like this, and im open to any format or type to return back (dict, dataframe, etc). I will say that the data will eventually go into pandas, so any data format to do that easily would be appreciated. FYI I couldnt get the ID column to align:
main_text sub_text id
0 Dog, facts and photos
1 Domestic dog mostly kept as pets 090b4d91
2 Dog - Wikipedia
3 Dog Domesticated canid species "\"Pooch\" For other \"Doggy\" Gmelin, 1792. c6edc846,e1689ad9,0c95357e
My code currently is not traversing the full structure and not pulling out 'value' string where 'type' is not equal to 'HTML' of the parent and not grabbing the IDs, bc I dont know how to implement this and how to structure the data. The code is able to grab some parent and child values.
code:
def extract_values_from_child_array(data):
results = {}
for d in data:
if "document_tree" in d:
for t in d["document_tree"]:
if t["type"].startswith("H"):
current_type = t["value"]
if "child" in t:
for c in t["child"]:
if c["type"].startswith("H"):
current_type = c["value"]
if current_type not in results:
results[current_type] = ""
elif c["type"] == "HTML":
if current_type not in results:
results[current_type] = ""
results[current_type] += c["value"]
if "child" in c:
for gc in c["child"]:
if gc["type"] == "HTML":
if current_type not in results:
results[current_type] = ""
results[current_type] += gc["value"]
return results
results1 = extract_values_from_child_array(data1)
tree:
─ (array)
└─ (object)
├─ "id" (string)
├─ "key" (string)
├─ "number" (number)
├─ "result_title" (string)
├─ "result_url" (string)
├─ "document_tree" (array)
│ └─ (object)
│ ├─ "type" (string)
│ ├─ "value" (string)
│ ├─ "child" (array)
│ │ └─ (object)
│ │ ├─ "type" (string)
│ │ ├─ "value" (string)
│ │ ├─ "child" (array)
│ │ │ └─ (object)
│ │ │ ├─ "type" (string)
│ │ │ ├─ "key" (string)
│ │ │ ├─ "id" (string)
│ │ │ └─ "value" (string)
│ │ ├─ "id" (string)
│ │ └─ "key" (string)
│ ├─ "id" (string)
│ └─ "key" (string)
├─ "featured_image_url" (string)
├─ "hidden" (number)
├─ "domain" (string)
└─ "result_preview_text" (string)
example:
[
{
"id": "1",
"key": "example_1",
"number": 1,
"result_title": "Result Title 1",
"result_url": "https://example.com/result_1",
"document_tree": [
{
"type": "H1",
"value": "Dog, facts and photos",
"child": [
{
"type": "H1",
"value": "Domestic dog",
"child": [
{
"type": "HTML",
"key": "090b4d91",
"id": "090b4d91",
"value": "mostly kept as pets"
}
],
"id": "1",
"key": "key_1"
}
],
"featured_image_url": "https://example.com/featured_image_1.jpg",
"hidden": 0,
"domain": "example.com",
"result_preview_text": "Result Preview Text 1"
},
{
"id": "2",
"key": "example_2",
"number": 2,
"result_title": "Result Title 2",
"result_url": "https://example.com/result_2",
"document_tree": [
{
"type": "H1",
"value": "Dog - Wikipedia",
"child": [
{
"type": "H1",
"value": "Dog",
"child": [
{
"type": "HTML",
"key": "c6edc846",
"id": "c6edc846",
"value": "Domesticated canid species"
},
{
"type": "HTML",
"key": "e1689ad9",
"id": "e1689ad9",
"value": "\"Pooch\" For other ."
},
{
"type": "HTML",
"key": "0c95357e",
"id": "0c95357e",
"value": "\"Doggy\" Gmelin, 1792"
}
],
"featured_image_url": "https://example.com/featured_image_2.jpg",
"hidden": 1,
"domain": "example.com",
"result_preview_text": "Result Preview Text 2"
}
]
}
]
}
]
}
]
答案1
得分: 1
截止目前(2023-02-23 22:05 PST),我认为问题中提供的示例是不正确的。下面的答案是基于以下示例。请注意,在以下示例中,“example_1”和“example_2”位于同一级,而在问题中,“example_2”包含在“example_1”中。
[
{
"id": "1",
"key": "example_1",
"number": 1,
"result_title": "Result Title 1",
"result_url": "https://example.com/result_1",
"document_tree": [
{
"type": "H1",
"value": "Dog, facts and photos",
"child": [
{
"type": "H1",
"value": "Domestic dog",
"child": [
{
"type": "HTML",
"key": "090b4d91",
"id": "090b4d91",
"value": "mostly kept as pets"
}
],
"id": "1",
"key": "key_1"
}
],
"featured_image_url": "https://example.com/featured_image_1.jpg",
"hidden": 0,
"domain": "example.com",
"result_preview_text": "Result Preview Text 1"
}
]
},
{
"id": "2",
"key": "example_2",
"number": 2,
"result_title": "Result Title 2",
"result_url": "https://example.com/result_2",
"document_tree": [
{
"type": "H1",
"value": "Dog - Wikipedia",
"child": [
{
"type": "H1",
"value": "Dog",
"child": [
{
"type": "HTML",
"key": "c6edc846",
"id": "c6edc846",
"value": "Domesticated canid species"
},
{
"type": "HTML",
"key": "e1689ad9",
"id": "e1689ad9",
"value": "\"Pooch\" For other ."
},
{
"type": "HTML",
"key": "0c95357e",
"id": "0c95357e",
"value": "\"Doggy\" Gmelin, 1792"
}
],
"featured_image_url": "https://example.com/featured_image_2.jpg",
"hidden": 1,
"domain": "example.com",
"result_preview_text": "Result Preview Text 2"
}
]
}
]
}
]
假设对示例进行的更正是有效的,下面显示了可能的解决方案(有关递归逻辑的详细信息,请参见文档字符串)。
from typing import Dict, Tuple
import pandas as pd
df_dict = {
'main_text': [],
'sub_text': [],
'id': [],
}
def process_node(node: Dict) -> Tuple[str, str]:
"""处理每个父节点和子节点。
逻辑是,如果节点的类型是HTML,它必须是左节点。我们只返回其值和ID。
如果节点的类型以“H”开头,它可能包含HTML子节点或不包含。
无论如何,我们都处理其子节点并获取其值和ID。
如果子节点是HTML,我们将值和ID记录在本地数组中。
否则,我们忽略它们。
最终,我们将这些值和ID连接起来,并将它们提供给df_dict,以及当前节点的值。
:param node: 文档树中的一个节点
:type node: Dict
:return: (值,ID)的HTML节点。否则,('','')作为虚拟值。
:rtype: Tuple[str, str]
"""
if node.get('type', '') == 'HTML':
return node['value'], node['id']
val_list = []
id_list = []
if node.get('type', '').startswith('H'):
for child in node.get('child', []):
child_val, child_id = process_node(child)
if child['type'] == 'HTML':
val_list.append(child_val)
id_list.append(child_id)
df_dict['main_text'].append(node.get('value', ''))
df_dict['sub_text'].append(' '.join(val_list))
df_dict['id'].append(','.join(id_list))
return '', '' # 返回虚拟值
for top_object in data:
for root in top_object['document_tree']:
process_node(root)
df = pd.DataFrame.from_dict(df_dict)
# 显示
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)
print(df)
输出:
main_text sub_text id
0 Domestic dog mostly kept as pets 090b4d91
1 Dog, facts and photos
2 Dog Domesticated canid species "Pooch" For other . "Doggy" Gmelin, 1792 c6edc846,e1689ad9,0c95357e
3 Dog - Wikipedia
英文:
As of now (2023-02-23 22:05 PST), I think the example provided in the question is not correct. The answer below is based on the following example. Notice that in the following example, "example_1" and "example_2" are on the same level, whereas in the question, "example_2" is contained within "example_1".
[
{
"id": "1",
"key": "example_1",
"number": 1,
"result_title": "Result Title 1",
"result_url": "https://example.com/result_1",
"document_tree": [
{
"type": "H1",
"value": "Dog, facts and photos",
"child": [
{
"type": "H1",
"value": "Domestic dog",
"child": [
{
"type": "HTML",
"key": "090b4d91",
"id": "090b4d91",
"value": "mostly kept as pets"
}
],
"id": "1",
"key": "key_1"
}
],
"featured_image_url": "https://example.com/featured_image_1.jpg",
"hidden": 0,
"domain": "example.com",
"result_preview_text": "Result Preview Text 1"
}
]
},
{
"id": "2",
"key": "example_2",
"number": 2,
"result_title": "Result Title 2",
"result_url": "https://example.com/result_2",
"document_tree": [
{
"type": "H1",
"value": "Dog - Wikipedia",
"child": [
{
"type": "H1",
"value": "Dog",
"child": [
{
"type": "HTML",
"key": "c6edc846",
"id": "c6edc846",
"value": "Domesticated canid species"
},
{
"type": "HTML",
"key": "e1689ad9",
"id": "e1689ad9",
"value": "\"Pooch\" For other ."
},
{
"type": "HTML",
"key": "0c95357e",
"id": "0c95357e",
"value": "\"Doggy\" Gmelin, 1792"
}
],
"featured_image_url": "https://example.com/featured_image_2.jpg",
"hidden": 1,
"domain": "example.com",
"result_preview_text": "Result Preview Text 2"
}
]
}
]
}
]
Suppose the correction made to the example is valid, a possible solution is shown below (see the docstring for details in recursion logic).
from typing import Dict, Tuple
import pandas as pd
df_dict = {
'main_text': [],
'sub_text': [],
'id': [],
}
def process_node(node: Dict) -> Tuple[str, str]:
"""Process each parent and child node.
The logic is that if the node's type is HTML, it must be a left node. We
simply return its value and ID.
If the node's type starts with "H", it might contain HTML children or not.
Regardless, we process its children and get their values and IDs.
If the children are HTML, we record values and IDs in local arrays.
Otherwise, we ignore them.
Eventually, we concatenate the values and IDs, and supply them, along with
the current node's value to df_dict.
:param node: a node in document tree
:type node: Dict
:return: (value, ID) of an HTML node. Otherwise, ('', '') as dummy values.
:rtype: Tuple[str, str]
"""
if node.get('type', '') == 'HTML':
return node['value'], node['id']
val_list = []
id_list = []
if node.get('type', '').startswith('H'):
for child in node.get('child', []):
child_val, child_id = process_node(child)
if child['type'] == 'HTML':
val_list.append(child_val)
id_list.append(child_id)
df_dict['main_text'].append(node.get('value', ''))
df_dict['sub_text'].append(' '.join(val_list))
df_dict['id'].append(','.join(id_list))
return '', '' # return dummy values
for top_object in data:
for root in top_object['document_tree']:
process_node(root)
df = pd.DataFrame.from_dict(df_dict)
# display
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)
print(df)
Output:
main_text sub_text id
0 Domestic dog mostly kept as pets 090b4d91
1 Dog, facts and photos
2 Dog Domesticated canid species "Pooch" For other . "Doggy" Gmelin, 1792 c6edc846,e1689ad9,0c95357e
3 Dog - Wikipedia
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论