使用pandas在数据框中创建层次结构,需要使用4列。

huangapple go评论101阅读模式
英文:

Creating hierarchy using 4 columns in dataframe - pandas

问题

以下是翻译好的部分:

  1. # 数据框如下
  2. ID ParentID Filter Text
  3. 0 98 97 NULL AA
  4. 1 99 98 NULL BB
  5. 2 100 99 NULL CC
  6. 3 107 100 1 DD
  7. 4 9999 1231 NULL EE
  8. 5 10000 1334 NULL FF
  9. 6 10001 850 2 GG
  10. 7 850 230 NULL HH
  11. 8 230 121 NULL II
  12. 9 121 96 NULL JJ
  13. 10 96 0 NULL KK
  14. 11 97 0 NULL LL
  15. # 我需要添加一个额外的列 "Hierarchy",如下
  16. ID ParentID Filter Text Hierarchy
  17. 0 98 97 NULL AA
  18. 1 99 98 NULL BB
  19. 2 100 99 NULL CC
  20. 3 107 100 1 DD DD/CC/BB/AA/LL
  21. 4 9999 1231 NULL EE
  22. 5 10000 1334 NULL FF
  23. 6 10001 850 2 GG GG/HH/II/JJ/KK
  24. 7 850 230 NULL HH
  25. 8 230 121 NULL II
  26. 9 121 96 NULL JJ
  27. 10 96 0 NULL KK
  28. 11 97 0 NULL LL
  29. # 我寻找的规则如下:
  30. 1) 只为具有非空 Filter 值的行填充 "Hierarchy" 其余行不需要处理层次结构
  31. 2) 当发现具有非空 Filter 值的行时查找其 ParentID然后在 ID 列中查找此 ParentID一直向上递归直到父级ID0
  32. 3) 尝试使用 itertools但循环太长因为原始数据集很大
  33. 4) 记录集大小约为200k
  34. mozway 提供的以下解决方案似乎有效但对于包含200k条记录的记录集耗费了很长时间是否有办法可以加快解决方案的速度
英文:

Dataframe is below

  1. ID ParentID Filter Text
  2. 0 98 97 NULL AA
  3. 1 99 98 NULL BB
  4. 2 100 99 NULL CC
  5. 3 107 100 1 DD
  6. 4 9999 1231 NULL EE
  7. 5 10000 1334 NULL FF
  8. 6 10001 850 2 GG
  9. 7 850 230 NULL HH
  10. 8 230 121 NULL II
  11. 9 121 96 NULL JJ
  12. 10 96 0 NULL KK
  13. 11 97 0 NULL LL

I need to add an additional column hierarchy like this:

  1. ID ParentID Filter Text Hierarchy
  2. 0 98 97 NULL AA
  3. 1 99 98 NULL BB
  4. 2 100 99 NULL CC
  5. 3 107 100 1 DD DD/CC/BB/AA/LL
  6. 4 9999 1231 NULL EE
  7. 5 10000 1334 NULL FF
  8. 6 10001 850 2 GG GG/HH/II/JJ/KK
  9. 7 850 230 NULL HH
  10. 8 230 121 NULL II
  11. 9 121 96 NULL JJ
  12. 10 96 0 NULL KK
  13. 11 97 0 NULL LL

The rules I am looking at are below:

  1. Only populate hierarchy column for rows which have filter value populated, the rest of the rows don't need hierarchy done.

  2. When a row is found having filter value not null, lookup its parentID, then search this parentid in ID column. When found reclusively keep going up till, parent id is 0.

  3. Trying to do this with itertools but the looping is taking too long as the original dataset is huge

4)Recordset size is ~ 200k

The below solution provided kindly by mozway seems to work but for a recorset of 200k records, it takes a lot of time. Is there a tweak that can be done to this to get to the solution faster ?

答案1

得分: 3

这是一个图问题,你可以轻松使用 networkx 来解决。

  1. import networkx as nx
  2. m = df['Filter'].notna()
  3. nodes = df.loc[m, 'ID']
  4. mapper = df[m].set_index('ID')['Text']
  5. # 创建图
  6. G = nx.from_pandas_edgelist(df, source='ParentID', target='ID',
  7. create_using=nx.DiGraph)
  8. # 找到根节点
  9. roots = {n for n, deg in G.in_degree() if deg == 0}
  10. # {1231, 1334, 0}
  11. # 检索层次结构
  12. df.loc[m, 'Hierarchy'] = [
  13. ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
  14. for p in nx.all_simple_paths(G, r, n)])
  15. for n in nodes for r in roots
  16. for p in nx.all_simple_paths(G, r, n)
  17. ]

请注意,如果图形分支,可能会有多个层次结构。在这种情况下,它们将用 ; 分隔。

输出:

  1. ID ParentID Filter Text Hierarchy
  2. 0 98 97 NaN AA NaN
  3. 1 99 98 NaN BB NaN
  4. 2 100 99 NaN CC NaN
  5. 3 107 100 1.0 DD DD/CC/BB/AA/LL
  6. 4 9999 1231 NaN EE NaN
  7. 5 10000 1334 NaN FF NaN
  8. 6 10001 850 2.0 GG GG/HH/II/JJ/KK
  9. 7 850 230 NaN HH NaN
  10. 8 230 121 NaN II NaN
  11. 9 121 96 NaN JJ NaN
  12. 10 96 0 NaN KK NaN
  13. 11 97 0 NaN LL NaN

图形:

使用pandas在数据框中创建层次结构,需要使用4列。

潜在的优化

如果数据集很大,可能的优化是只遍历属于连接组件的根节点。在实际数据集中,你可以尝试看看是否可以提高性能。

  1. import networkx as nx
  2. m = df['Filter'].notna()
  3. nodes = df.loc[m, 'ID']
  4. mapper = df[m].set_index('ID')['Text']
  5. G = nx.from_pandas_edgelist(df, source='ParentID', target='ID', create_using=nx.DiGraph)
  6. roots = {n for n, deg in G.in_degree() if deg == 0}
  7. # {1231, 1334, 0}
  8. roots_dict = {n: s & roots for s in nx.weakly_connected_components(G) for n in s}
  9. df.loc[m, 'Hierarchy'] = [
  10. ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
  11. for p in nx.all_simple_paths(G, r, n)])
  12. for n in nodes for r in roots_dict[n]
  13. for p in nx.all_simple_paths(G, r, n)
  14. ]
英文:

This is a graph problem, which you can easily solve with networkx.

  1. import networkx as nx
  2. m = df['Filter'].notna()
  3. nodes = df.loc[m, 'ID']
  4. mapper = df[m].set_index('ID')['Text']
  5. # create graph
  6. G = nx.from_pandas_edgelist(df, source='ParentID', target='ID',
  7. create_using=nx.DiGraph)
  8. # find roots
  9. roots = {n for n, deg in G.in_degree() if deg==0}
  10. # {1231, 1334, 0}
  11. # retrieve hierarchy
  12. df.loc[m, 'Hierarchy'] = [
  13. ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
  14. for p in nx.all_simple_paths(G, r, n)])
  15. for n in nodes for r in roots
  16. for p in nx.all_simple_paths(G, r, n)
  17. ]

Note that there could be several hierarchies if the graph is branched. In this case, this would return all of them separated by ;.

Output:

  1. ID ParentID Filter Text Hierarchy
  2. 0 98 97 NaN AA NaN
  3. 1 99 98 NaN BB NaN
  4. 2 100 99 NaN CC NaN
  5. 3 107 100 1.0 DD DD/CC/BB/AA/LL
  6. 4 9999 1231 NaN EE NaN
  7. 5 10000 1334 NaN FF NaN
  8. 6 10001 850 2.0 GG GG/HH/II/JJ/KK
  9. 7 850 230 NaN HH NaN
  10. 8 230 121 NaN II NaN
  11. 9 121 96 NaN JJ NaN
  12. 10 96 0 NaN KK NaN
  13. 11 97 0 NaN LL NaN

Graph:

使用pandas在数据框中创建层次结构,需要使用4列。

potential optimization

If the dataset is huge, a potential optimization might be to only iterate over the roots that are part of the connected components. You'd have to try in the real dataset if this improves performance.

  1. import networkx as nx
  2. m = df['Filter'].notna()
  3. nodes = df.loc[m, 'ID']
  4. mapper = df[m].set_index('ID')['Text']
  5. G = nx.from_pandas_edgelist(df, source='ParentID', target='ID', create_using=nx.DiGraph)
  6. roots = {n for n, deg in G.in_degree() if deg==0}
  7. # {1231, 1334, 0}
  8. roots_dict = {n: s&roots for s in nx.weakly_connected_components(G) for n in s}
  9. df.loc[m, 'Hierarchy'] = [
  10. ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
  11. for p in nx.all_simple_paths(G, r, n)])
  12. for n in nodes for r in roots_dict[n]
  13. for p in nx.all_simple_paths(G, r, n)
  14. ]

答案2

得分: 2

这是涉及每个根节点的一次深度优先搜索的解决方案。
最坏时间复杂度为 O(V + FV),其中 V 是列数,F 是要填充层次结构的列数。
这可能比其他解决方案更快,因为它利用了给定图是树的事实,因此从根到任何节点只有一条路径。

  1. # 这是一个递归的深度优先搜索代码,附加逻辑用于存储有趣节点的层次结构
  2. def dfs(graph, stack, interesting, hierarchy):
  3. node = stack[-1]
  4. for child in graph[node]:
  5. stack.append(child)
  6. if child in interesting:
  7. hierarchy[child] = stack[:]
  8. dfs(graph, stack, interesting, hierarchy)
  9. stack.pop()
  10. # 使 'ID' 成为索引
  11. df = df.set_index("ID")
  12. # 找到根节点
  13. roots = df[df["ParentID"] == 0].index
  14. # 有趣节点以查找层次结构
  15. interesting = set(df[df["Filter"].notna()].index)
  16. # 字典以存储 id -> 层次结构映射
  17. hierarchy = {}
  18. # 构建父节点 -> 子节点的邻接列表中的有向图
  19. graph = defaultdict(list)
  20. for node, parent in zip(df.index, df["ParentID"]):
  21. graph[parent].append(node)
  22. # 对每个根运行dfs
  23. for root in roots:
  24. stack = [root]
  25. dfs(graph, stack, interesting, hierarchy)
  26. # 填充层次列
  27. df["Hierarchy"] = ""
  28. for node, path in hierarchy.items():
  29. df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])
  30. # 再次使 'ID' 成为列
  31. df = df.reset_index()
  32. # 现在我们完成了!
  33. print(df)

完整代码在 https://pastebin.com/6MFaqZQw

英文:

Here is a solution involving one pass of dfs for each root node.
The worst time complexity is O(V + FV) where V is the number of columns and F is
the number of columns to populate hierarchy for.
This might be faster than other solutions as it exploits the fact that the
given graph is a tree and hence there is only one path from root to any node.

  1. # this is a recursive dfs code with additional logic to store the hierarchy
  2. # of interesting nodes
  3. def dfs(graph, stack, interesting, hierarchy):
  4. node = stack[-1]
  5. for child in graph[node]:
  6. stack.append(child)
  7. if child in interesting:
  8. hierarchy[child] = stack[:]
  9. dfs(graph, stack, interesting, hierarchy)
  10. stack.pop()
  11. # make 'ID' the index
  12. df = df.set_index("ID")
  13. # find the roots
  14. roots = df[df["ParentID"] == 0].index
  15. # interesting nodes to find the hierarchy for
  16. interesting = set(df[df["Filter"].notna()].index)
  17. # dict to store the id -> hierarchy mapping
  18. hierarchy = {}
  19. # build a directed graph in adjacency list of parent -> children
  20. graph = defaultdict(list)
  21. for node, parent in zip(df.index, df["ParentID"]):
  22. graph[parent].append(node)
  23. # run dfs on each root
  24. for root in roots:
  25. stack = [root]
  26. dfs(graph, stack, interesting, hierarchy)
  27. # populate the hierarchy column
  28. df["Hierarchy"] = ""
  29. for node, path in hierarchy.items():
  30. df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])
  31. # make 'ID' a column again
  32. df = df.reset_index()
  33. # now we're done!
  34. print(df)

Full code is in <https://pastebin.com/6MFaqZQw>.

答案3

得分: 1

也许你可以尝试使用字典。不确定,但让我们看看。
创建一个测试数据框:

  1. import pandas as pd
  2. data = {
  3. 'ID': [98, 99, 100, 107, 9999, 10000, 10001, 850, 230, 121, 96, 97],
  4. 'ParentID': [97, 98, 99, 100, 1231, 1334, 850, 230, 121, 96, 0, 0],
  5. 'Filter Text': [None, None, None, '1', None, None, '2', None, None, None, None, None],
  6. 'Text': ['AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'GG', 'HH', 'II', 'JJ', 'KK', 'LL']
  7. }
  8. df = pd.DataFrame(data)

初始化保存数据的位置:

  1. df = pd.DataFrame(data)
  2. df['Hierarchy'] = ""
  3. parent_child_dict = {}

处理字典的主要逻辑:

  1. for index, row in df.iterrows():
  2. current_id = row['ID']
  3. parent_id = row['ParentID']
  4. parent_child_dict[current_id] = parent_id
  5. for index, row in df.iterrows():
  6. hierarchy = []
  7. current_id = row['ID']
  8. while current_id != 0:
  9. parent_id = parent_child_dict.get(current_id)
  10. if parent_id is None:
  11. break
  12. parent_row = df.loc[df['ID'] == parent_id]
  13. if parent_row.empty:
  14. break
  15. parent_text = parent_row['Text'].values[0]
  16. hierarchy.insert(0, parent_text)
  17. current_id = parent_id
  18. hierarchy.append(row['Text'])
  19. df.at[index, 'Hierarchy'] = '/'.join(hierarchy)
  20. print(df)

这是你提供的代码的翻译部分。

英文:

Maybe you can try dictionaries. Not sure, but let's see.
Creating a test dataframe:

  1. import pandas as pd
  2. data = {
  3. &#39;ID&#39;: [98, 99, 100, 107, 9999, 10000, 10001, 850, 230, 121, 96, 97],
  4. &#39;ParentID&#39;: [97, 98, 99, 100, 1231, 1334, 850, 230, 121, 96, 0, 0],
  5. &#39;Filter Text&#39;: [None, None, None, &#39;1&#39;, None, None, &#39;2&#39;, None, None, None, None, None],
  6. &#39;Text&#39;: [&#39;AA&#39;, &#39;BB&#39;, &#39;CC&#39;, &#39;DD&#39;, &#39;EE&#39;, &#39;FF&#39;, &#39;GG&#39;, &#39;HH&#39;, &#39;II&#39;, &#39;JJ&#39;, &#39;KK&#39;, &#39;LL&#39;]
  7. }
  8. df = pd.DataFrame(data)

Initialize where you will keep your data:

  1. df = pd.DataFrame(data)
  2. df[&#39;Hierarchy&#39;] = &quot;&quot;
  3. parent_child_dict = {}

The main logic to play with dictionaries

  1. for index, row in df.iterrows():
  2. current_id = row[&#39;ID&#39;]
  3. parent_id = row[&#39;ParentID&#39;]
  4. parent_child_dict[current_id] = parent_id
  5. for index, row in df.iterrows():
  6. hierarchy = []
  7. current_id = row[&#39;ID&#39;]
  8. while current_id != 0:
  9. parent_id = parent_child_dict.get(current_id)
  10. if parent_id is None:
  11. break
  12. parent_row = df.loc[df[&#39;ID&#39;] == parent_id]
  13. if parent_row.empty:
  14. break
  15. parent_text = parent_row[&#39;Text&#39;].values[0]
  16. hierarchy.insert(0, parent_text)
  17. current_id = parent_id
  18. hierarchy.append(row[&#39;Text&#39;])
  19. df.at[index, &#39;Hierarchy&#39;] = &#39;/&#39;.join(hierarchy)
  20. print(df)

huangapple
  • 本文由 发表于 2023年6月2日 13:17:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76387309.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定