创建分类之间的层次结构。

huangapple go评论89阅读模式
英文:

Create a hierarchy between categories

问题

以下是您提供的代码的翻译部分:

  1. category_counts = {}
  2. for index, row in pandas_test.iterrows():
  3. categories = row['TAGS']
  4. for i in range(len(categories)):
  5. category = categories[i].strip()
  6. if category not in category_counts:
  7. category_counts[category] = {'count': 1, 'subcategories': set()}
  8. else:
  9. category_counts[category]['count'] += 1
  10. for j in range(i + 1, len(categories)):
  11. subcategory = categories[j].strip()
  12. category_counts[category]['subcategories'].add(subcategory)
  13. # 分析category_counts字典以确定层次结构
  14. hierarchy = {}
  15. for category, data in category_counts.items():
  16. subcategories = data['subcategories']
  17. for subcategory in subcategories:
  18. if subcategory in category_counts:
  19. if category not in category_counts[subcategory]['subcategories']:
  20. hierarchy[subcategory] = category
  21. # 将层次结构应用于类别
  22. for category, parent in hierarchy.items():
  23. if parent in hierarchy:
  24. hierarchy[category] = hierarchy[parent]
  25. print(hierarchy)

请注意,这段代码用于查找每个类别的父类。如果您有任何问题或需要进一步的帮助,请告诉我。

英文:

I have this following dataframe :

  1. pandas_test=pd.DataFrame(data={'TAGS': [['Category1','Category2','Category3'],
  2. ['Category2','Category4'],
  3. ['Category5','Category4'],
  4. ['Category5','Category4','Category6','Category8'],
  5. ['Category1','Category2'],
  6. ['Category2','Category3']]})

I try to find the parent of each category. To explain how it should work : a categoryA on the left and of the same row of another categoryB would be his parent. So in the case of pandas_test, I would like this result :

  1. {'Category1': None, Category2': 'Category1', 'Category3': 'Category2’, 'Category4': 'Category2', 'Category5': None, 'Category4’: 'Category5’, 'Category6’: 'Category4’, 'Category8’: 'Category6’}.

Here, Category1 doesn't have a parent, Category2 has Category1 as a parent, Category3 has Category2, etc...

For the moment, I have the following code :

  1. category_counts = {}
  2. for index, row in pandas_test.iterrows():
  3. #categories = row['TAGS'][0].split(',') if row['TAGS'] else []
  4. categories = row['TAGS']
  5. for i in range(len(categories)):
  6. category = categories[i].strip()
  7. if category not in category_counts:
  8. category_counts[category] = {'count': 1, 'subcategories': set()}
  9. else:
  10. category_counts[category]['count'] += 1
  11. for j in range(i + 1, len(categories)):
  12. subcategory = categories[j].strip()
  13. category_counts[category]['subcategories'].add(subcategory)
  14. # Analyze category_counts dictionary to determine hierarchy
  15. hierarchy = {}
  16. for category, data in category_counts.items():
  17. subcategories = data['subcategories']
  18. for subcategory in subcategories:
  19. if subcategory in category_counts:
  20. if category not in category_counts[subcategory]['subcategories']:
  21. hierarchy[subcategory] = category
  22. # Apply hierarchy to categories
  23. for category, parent in hierarchy.items():
  24. if parent in hierarchy:
  25. hierarchy[category] = hierarchy[parent]
  26. print(hierarchy)

But this code returns me this following result :

  1. {'Category3': 'Category1', 'Category2': 'Category1', 'Category4': 'Category5', 'Category6': 'Category5', 'Category8': 'Category5'}

Category3 should have Category2 as a parent. Of course Category1 is a parent of Category3 aswell because Category1 is a parent of Category2 and Category2 is the parent of Category3, but I want the closest parent (so same for Category 6 and 8 having Category5 as a parent). Also, I want Category4 being the son of Category2 AND Category5.

Can someone helps me please?

答案1

得分: 2

这是一个图问题,使用专门的库如 networkx 来构建有向图,并获取每个节点的 predecessors

  1. # pip install networkx
  2. import networkx as nx
  3. from itertools import pairwise
  4. G = nx.from_edgelist([edge for l in pandas_test['TAGS']
  5. for edge in pairwise(l)],
  6. create_using=nx.DiGraph)
  7. out = {n: list(G.predecessors(n)) for n in G}
  8. print(out)

注意:在 Python 版本低于 3.10 上,可以用 zip(l, l[1:]) 替换 pairwise(l)

输出:

  1. {'Category1': [],
  2. 'Category2': ['Category1'],
  3. 'Category3': ['Category2'],
  4. 'Category4': ['Category2', 'Category5'],
  5. 'Category5': [],
  6. 'Category6': ['Category4'],
  7. 'Category8': ['Category6']}

作为 DataFrame:

  1. df_out = (pd.Series(out).explode()
  2. .rename_axis('Child').reset_index(name='Parent')
  3. )

输出:

  1. Child Parent
  2. 0 Category1 NaN
  3. 1 Category2 Category1
  4. 2 Category3 Category2
  5. 3 Category4 Category2
  6. 4 Category4 Category5
  7. 5 Category5 NaN
  8. 6 Category6 Category4
  9. 7 Category8 Category6

图:

创建分类之间的层次结构。

英文:

This is a graph problem, use a specialized library like networkx to build a directed graph and get the predecessors of each node:

  1. # pip install networkx
  2. import networkx as nx
  3. from itertools import pairwise
  4. G = nx.from_edgelist([edge for l in pandas_test['TAGS']
  5. for edge in pairwise(l)],
  6. create_using=nx.DiGraph)
  7. out = {n: list(G.predecessors(n)) for n in G}
  8. print(out)

NB. on python <3.10 you can replace pairwise(l) by zip(l, l[1:]).

Output:

  1. {&#39;Category1&#39;: [],
  2. &#39;Category2&#39;: [&#39;Category1&#39;],
  3. &#39;Category3&#39;: [&#39;Category2&#39;],
  4. &#39;Category4&#39;: [&#39;Category2&#39;, &#39;Category5&#39;],
  5. &#39;Category5&#39;: [],
  6. &#39;Category6&#39;: [&#39;Category4&#39;],
  7. &#39;Category8&#39;: [&#39;Category6&#39;]}

As a DataFrame:

  1. df_out = (pd.Series(out).explode()
  2. .rename_axis(&#39;Child&#39;).reset_index(name=&#39;Parent&#39;)
  3. )

Output:

  1. Child Parent
  2. 0 Category1 NaN
  3. 1 Category2 Category1
  4. 2 Category3 Category2
  5. 3 Category4 Category2
  6. 4 Category4 Category5
  7. 5 Category5 NaN
  8. 6 Category6 Category4
  9. 7 Category8 Category6

The graph:

创建分类之间的层次结构。

答案2

得分: 0

请尝试以下方法来实现您的输出。

  1. result_dict = {}
  2. for tags in pandas_test['TAGS']:
  3. for i, category in enumerate(tags):
  4. if i > 0:
  5. result_dict[category] = tags[i - 1]
  6. else:
  7. result_dict[category] = None
  8. for tags in pandas_test['TAGS']:
  9. for i, category in enumerate(tags):
  10. if i > 0 and tags[i - 1] not in result_dict:
  11. result_dict[tags[i - 1]] = None
  12. print(result_dict)
英文:

Please try the following method to achieve your output.

  1. result_dict = {}
  2. for tags in pandas_test[&#39;TAGS&#39;]:
  3. for i, category in enumerate(tags):
  4. if i &gt; 0:
  5. result_dict[category] = tags[i - 1]
  6. else:
  7. result_dict[category] = None
  8. for tags in pandas_test[&#39;TAGS&#39;]:
  9. for i, category in enumerate(tags):
  10. if i &gt; 0 and tags[i - 1] not in result_dict:
  11. result_dict[tags[i - 1]] = None
  12. print(result_dict)

huangapple
  • 本文由 发表于 2023年6月22日 16:51:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76530115.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定