英文:
Create a hierarchy between categories
问题
以下是您提供的代码的翻译部分:
category_counts = {}
for index, row in pandas_test.iterrows():
categories = row['TAGS']
for i in range(len(categories)):
category = categories[i].strip()
if category not in category_counts:
category_counts[category] = {'count': 1, 'subcategories': set()}
else:
category_counts[category]['count'] += 1
for j in range(i + 1, len(categories)):
subcategory = categories[j].strip()
category_counts[category]['subcategories'].add(subcategory)
# 分析category_counts字典以确定层次结构
hierarchy = {}
for category, data in category_counts.items():
subcategories = data['subcategories']
for subcategory in subcategories:
if subcategory in category_counts:
if category not in category_counts[subcategory]['subcategories']:
hierarchy[subcategory] = category
# 将层次结构应用于类别
for category, parent in hierarchy.items():
if parent in hierarchy:
hierarchy[category] = hierarchy[parent]
print(hierarchy)
请注意,这段代码用于查找每个类别的父类。如果您有任何问题或需要进一步的帮助,请告诉我。
英文:
I have this following dataframe :
pandas_test=pd.DataFrame(data={'TAGS': [['Category1','Category2','Category3'],
['Category2','Category4'],
['Category5','Category4'],
['Category5','Category4','Category6','Category8'],
['Category1','Category2'],
['Category2','Category3']]})
I try to find the parent of each category. To explain how it should work : a categoryA on the left and of the same row of another categoryB would be his parent. So in the case of pandas_test, I would like this result :
{'Category1': None, ‘Category2': 'Category1', 'Category3': 'Category2’, 'Category4': 'Category2', 'Category5': None, 'Category4’: 'Category5’, 'Category6’: 'Category4’, 'Category8’: 'Category6’}.
Here, Category1 doesn't have a parent, Category2 has Category1 as a parent, Category3 has Category2, etc...
For the moment, I have the following code :
category_counts = {}
for index, row in pandas_test.iterrows():
#categories = row['TAGS'][0].split(',') if row['TAGS'] else []
categories = row['TAGS']
for i in range(len(categories)):
category = categories[i].strip()
if category not in category_counts:
category_counts[category] = {'count': 1, 'subcategories': set()}
else:
category_counts[category]['count'] += 1
for j in range(i + 1, len(categories)):
subcategory = categories[j].strip()
category_counts[category]['subcategories'].add(subcategory)
# Analyze category_counts dictionary to determine hierarchy
hierarchy = {}
for category, data in category_counts.items():
subcategories = data['subcategories']
for subcategory in subcategories:
if subcategory in category_counts:
if category not in category_counts[subcategory]['subcategories']:
hierarchy[subcategory] = category
# Apply hierarchy to categories
for category, parent in hierarchy.items():
if parent in hierarchy:
hierarchy[category] = hierarchy[parent]
print(hierarchy)
But this code returns me this following result :
{'Category3': 'Category1', 'Category2': 'Category1', 'Category4': 'Category5', 'Category6': 'Category5', 'Category8': 'Category5'}
Category3 should have Category2 as a parent. Of course Category1 is a parent of Category3 aswell because Category1 is a parent of Category2 and Category2 is the parent of Category3, but I want the closest parent (so same for Category 6 and 8 having Category5 as a parent). Also, I want Category4 being the son of Category2 AND Category5.
Can someone helps me please?
答案1
得分: 2
这是一个图问题,使用专门的库如 networkx
来构建有向图,并获取每个节点的 predecessors
:
# pip install networkx
import networkx as nx
from itertools import pairwise
G = nx.from_edgelist([edge for l in pandas_test['TAGS']
for edge in pairwise(l)],
create_using=nx.DiGraph)
out = {n: list(G.predecessors(n)) for n in G}
print(out)
注意:在 Python 版本低于 3.10 上,可以用 zip(l, l[1:])
替换 pairwise(l)
。
输出:
{'Category1': [],
'Category2': ['Category1'],
'Category3': ['Category2'],
'Category4': ['Category2', 'Category5'],
'Category5': [],
'Category6': ['Category4'],
'Category8': ['Category6']}
作为 DataFrame:
df_out = (pd.Series(out).explode()
.rename_axis('Child').reset_index(name='Parent')
)
输出:
Child Parent
0 Category1 NaN
1 Category2 Category1
2 Category3 Category2
3 Category4 Category2
4 Category4 Category5
5 Category5 NaN
6 Category6 Category4
7 Category8 Category6
图:
英文:
This is a graph problem, use a specialized library like networkx
to build a directed graph and get the predecessors
of each node:
# pip install networkx
import networkx as nx
from itertools import pairwise
G = nx.from_edgelist([edge for l in pandas_test['TAGS']
for edge in pairwise(l)],
create_using=nx.DiGraph)
out = {n: list(G.predecessors(n)) for n in G}
print(out)
NB. on python <3.10 you can replace pairwise(l)
by zip(l, l[1:])
.
Output:
{'Category1': [],
'Category2': ['Category1'],
'Category3': ['Category2'],
'Category4': ['Category2', 'Category5'],
'Category5': [],
'Category6': ['Category4'],
'Category8': ['Category6']}
As a DataFrame:
df_out = (pd.Series(out).explode()
.rename_axis('Child').reset_index(name='Parent')
)
Output:
Child Parent
0 Category1 NaN
1 Category2 Category1
2 Category3 Category2
3 Category4 Category2
4 Category4 Category5
5 Category5 NaN
6 Category6 Category4
7 Category8 Category6
The graph:
答案2
得分: 0
请尝试以下方法来实现您的输出。
result_dict = {}
for tags in pandas_test['TAGS']:
for i, category in enumerate(tags):
if i > 0:
result_dict[category] = tags[i - 1]
else:
result_dict[category] = None
for tags in pandas_test['TAGS']:
for i, category in enumerate(tags):
if i > 0 and tags[i - 1] not in result_dict:
result_dict[tags[i - 1]] = None
print(result_dict)
英文:
Please try the following method to achieve your output.
result_dict = {}
for tags in pandas_test['TAGS']:
for i, category in enumerate(tags):
if i > 0:
result_dict[category] = tags[i - 1]
else:
result_dict[category] = None
for tags in pandas_test['TAGS']:
for i, category in enumerate(tags):
if i > 0 and tags[i - 1] not in result_dict:
result_dict[tags[i - 1]] = None
print(result_dict)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论