2023年6月2日 13:17:10go评论101阅读模式

英文:

Creating hierarchy using 4 columns in dataframe - pandas

问题

以下是翻译好的部分：

# 数据框如下
    ID        ParentID   Filter Text
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL
# 我需要添加一个额外的列 "Hierarchy"，如下
    ID        ParentID   Filter Text   Hierarchy
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD      DD/CC/BB/AA/LL
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG      GG/HH/II/JJ/KK
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL
# 我寻找的规则如下：
1) 只为具有非空 Filter 值的行填充 "Hierarchy" 列，其余行不需要处理层次结构。
2) 当发现具有非空 Filter 值的行时，查找其 ParentID，然后在 ID 列中查找此 ParentID。一直向上递归，直到父级ID为0。
3) 尝试使用 itertools，但循环太长，因为原始数据集很大。
4) 记录集大小约为200k。
由 mozway 提供的以下解决方案似乎有效，但对于包含200k条记录的记录集，耗费了很长时间。是否有办法可以加快解决方案的速度？

英文:

Dataframe is below

    ID        ParentID   Filter Text
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

I need to add an additional column hierarchy like this:

    ID        ParentID   Filter Text   Hierarchy
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD      DD/CC/BB/AA/LL
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG      GG/HH/II/JJ/KK
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

The rules I am looking at are below:

Only populate hierarchy column for rows which have filter value populated, the rest of the rows don't need hierarchy done.
When a row is found having filter value not null, lookup its parentID, then search this parentid in ID column. When found reclusively keep going up till, parent id is 0.
Trying to do this with itertools but the looping is taking too long as the original dataset is huge

4)Recordset size is ~ 200k

The below solution provided kindly by mozway seems to work but for a recorset of 200k records, it takes a lot of time. Is there a tweak that can be done to this to get to the solution faster ?

答案1

得分: 3

这是一个图问题，你可以轻松使用 networkx 来解决。

使用 nx.from_pandas_edgelist 创建一个有向图。
使用列表推导和 in_degree 找到根节点。
使用 all_simple_paths 和映射系列获取选定节点和根节点之间的路径。
分配给你的 DataFrame

import networkx as nx
m = df['Filter'].notna()
nodes = df.loc[m, 'ID']
mapper = df[m].set_index('ID')['Text']
# 创建图
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID',
                            create_using=nx.DiGraph)
# 找到根节点
roots = {n for n, deg in G.in_degree() if deg == 0}
# {1231, 1334, 0}
# 检索层次结构
df.loc[m, 'Hierarchy'] = [
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
              for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots
    for p in nx.all_simple_paths(G, r, n)
]

请注意，如果图形分支，可能会有多个层次结构。在这种情况下，它们将用 ; 分隔。

输出:

       ID  ParentID  Filter Text       Hierarchy
0      98        97     NaN   AA             NaN
1      99        98     NaN   BB             NaN
2     100        99     NaN   CC             NaN
3     107       100     1.0   DD  DD/CC/BB/AA/LL
4    9999      1231     NaN   EE             NaN
5   10000      1334     NaN   FF             NaN
6   10001       850     2.0   GG  GG/HH/II/JJ/KK
7     850       230     NaN   HH             NaN
8     230       121     NaN   II             NaN
9     121        96     NaN   JJ             NaN
10     96         0     NaN   KK             NaN
11     97         0     NaN   LL             NaN

图形:

潜在的优化

如果数据集很大，可能的优化是只遍历属于连接组件的根节点。在实际数据集中，你可以尝试看看是否可以提高性能。

import networkx as nx
m = df['Filter'].notna()
nodes = df.loc[m, 'ID']
mapper = df[m].set_index('ID')['Text']
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID', create_using=nx.DiGraph)
roots = {n for n, deg in G.in_degree() if deg == 0}
# {1231, 1334, 0}
roots_dict = {n: s & roots for s in nx.weakly_connected_components(G) for n in s}
df.loc[m, 'Hierarchy'] = [
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
              for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots_dict[n]
    for p in nx.all_simple_paths(G, r, n)
]

英文:

This is a graph problem, which you can easily solve with networkx.

create a directed graph with nx.from_pandas_edgelist
find the roots with a list comprehension and in_degree
get the paths between the selected nodes and the roots using all_simple_paths and a mapping Series
assign to you DataFrame

import networkx as nx
m = df[&#39;Filter&#39;].notna()
nodes = df.loc[m, &#39;ID&#39;]
mapper = df[m].set_index(&#39;ID&#39;)[&#39;Text&#39;]
# create graph
G = nx.from_pandas_edgelist(df, source=&#39;ParentID&#39;, target=&#39;ID&#39;,
                            create_using=nx.DiGraph)
# find roots
roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}
# retrieve hierarchy
df.loc[m, &#39;Hierarchy&#39;] = [
    &#39;;&#39;.join([&#39;/&#39;.join([mapper.get(x) for x in p[:0:-1]])
                        for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots
    for p in nx.all_simple_paths(G, r, n)
]

Note that there could be several hierarchies if the graph is branched. In this case, this would return all of them separated by ;.

Output:

       ID  ParentID  Filter Text       Hierarchy
0      98        97     NaN   AA             NaN
1      99        98     NaN   BB             NaN
2     100        99     NaN   CC             NaN
3     107       100     1.0   DD  DD/CC/BB/AA/LL
4    9999      1231     NaN   EE             NaN
5   10000      1334     NaN   FF             NaN
6   10001       850     2.0   GG  GG/HH/II/JJ/KK
7     850       230     NaN   HH             NaN
8     230       121     NaN   II             NaN
9     121        96     NaN   JJ             NaN
10     96         0     NaN   KK             NaN
11     97         0     NaN   LL             NaN

Graph:

potential optimization

If the dataset is huge, a potential optimization might be to only iterate over the roots that are part of the connected components. You'd have to try in the real dataset if this improves performance.

import networkx as nx
m = df[&#39;Filter&#39;].notna()
nodes = df.loc[m, &#39;ID&#39;]
mapper = df[m].set_index(&#39;ID&#39;)[&#39;Text&#39;]
G = nx.from_pandas_edgelist(df, source=&#39;ParentID&#39;, target=&#39;ID&#39;, create_using=nx.DiGraph)
roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}
roots_dict = {n: s&amp;roots for s in nx.weakly_connected_components(G) for n in s}
df.loc[m, &#39;Hierarchy&#39;] = [
    &#39;;&#39;.join([&#39;/&#39;.join([mapper.get(x) for x in p[:0:-1]])
                        for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots_dict[n]
    for p in nx.all_simple_paths(G, r, n)
]

答案2

得分: 2

这是涉及每个根节点的一次深度优先搜索的解决方案。
最坏时间复杂度为 O(V + FV)，其中 V 是列数，F 是要填充层次结构的列数。
这可能比其他解决方案更快，因为它利用了给定图是树的事实，因此从根到任何节点只有一条路径。

# 这是一个递归的深度优先搜索代码，附加逻辑用于存储有趣节点的层次结构
def dfs(graph, stack, interesting, hierarchy):
    node = stack[-1]
    for child in graph[node]:
        stack.append(child)
        if child in interesting:
            hierarchy[child] = stack[:]
        dfs(graph, stack, interesting, hierarchy)
    stack.pop()
# 使 'ID' 成为索引
df = df.set_index("ID")
# 找到根节点
roots = df[df["ParentID"] == 0].index
# 有趣节点以查找层次结构
interesting = set(df[df["Filter"].notna()].index)
# 字典以存储 id -> 层次结构映射
hierarchy = {}
# 构建父节点 -> 子节点的邻接列表中的有向图
graph = defaultdict(list)
for node, parent in zip(df.index, df["ParentID"]):
    graph[parent].append(node)
# 对每个根运行dfs
for root in roots:
    stack = [root]
    dfs(graph, stack, interesting, hierarchy)
# 填充层次列
df["Hierarchy"] = ""
for node, path in hierarchy.items():
    df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])
# 再次使 'ID' 成为列
df = df.reset_index()
# 现在我们完成了！
print(df)

完整代码在 https://pastebin.com/6MFaqZQw。

英文:

Here is a solution involving one pass of dfs for each root node.
The worst time complexity is O(V + FV) where V is the number of columns and F is
the number of columns to populate hierarchy for.
This might be faster than other solutions as it exploits the fact that the
given graph is a tree and hence there is only one path from root to any node.

# this is a recursive dfs code with additional logic to store the hierarchy
# of interesting nodes
def dfs(graph, stack, interesting, hierarchy):
    node = stack[-1]
    for child in graph[node]:
        stack.append(child)
        if child in interesting:
            hierarchy[child] = stack[:]
        dfs(graph, stack, interesting, hierarchy)
    stack.pop()
# make &#39;ID&#39; the index
df = df.set_index(&quot;ID&quot;)
# find the roots
roots = df[df[&quot;ParentID&quot;] == 0].index
# interesting nodes to find the hierarchy for
interesting = set(df[df[&quot;Filter&quot;].notna()].index)
# dict to store the id -&gt; hierarchy mapping
hierarchy = {}
# build a directed graph in adjacency list of parent -&gt; children
graph = defaultdict(list)
for node, parent in zip(df.index, df[&quot;ParentID&quot;]):
    graph[parent].append(node)
# run dfs on each root
for root in roots:
    stack = [root]
    dfs(graph, stack, interesting, hierarchy)
# populate the hierarchy column
df[&quot;Hierarchy&quot;] = &quot;&quot;
for node, path in hierarchy.items():
    df.loc[node, &quot;Hierarchy&quot;] = &quot;/&quot;.join(df.loc[path, &quot;Text&quot;])
# make &#39;ID&#39; a column again
df = df.reset_index()
# now we&#39;re done!
print(df)

Full code is in <https://pastebin.com/6MFaqZQw>.

答案3

得分: 1

也许你可以尝试使用字典。不确定，但让我们看看。
创建一个测试数据框：

import pandas as pd
data = {
    'ID': [98, 99, 100, 107, 9999, 10000, 10001, 850, 230, 121, 96, 97],
    'ParentID': [97, 98, 99, 100, 1231, 1334, 850, 230, 121, 96, 0, 0],
    'Filter Text': [None, None, None, '1', None, None, '2', None, None, None, None, None],
    'Text': ['AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'GG', 'HH', 'II', 'JJ', 'KK', 'LL']
}
df = pd.DataFrame(data)

初始化保存数据的位置：

df = pd.DataFrame(data)
df['Hierarchy'] = ""
parent_child_dict = {}

处理字典的主要逻辑：

for index, row in df.iterrows():
    current_id = row['ID']
    parent_id = row['ParentID']
    parent_child_dict[current_id] = parent_id
for index, row in df.iterrows():
    hierarchy = []
    current_id = row['ID']
    while current_id != 0:
        parent_id = parent_child_dict.get(current_id)
        if parent_id is None:
            break
        parent_row = df.loc[df['ID'] == parent_id]
        if parent_row.empty:
            break
        parent_text = parent_row['Text'].values[0]
        hierarchy.insert(0, parent_text)
        current_id = parent_id
    hierarchy.append(row['Text'])
    df.at[index, 'Hierarchy'] = '/'.join(hierarchy)
print(df)

这是你提供的代码的翻译部分。

英文:

Maybe you can try dictionaries. Not sure, but let's see.
Creating a test dataframe:

import pandas as pd
data = {
    &#39;ID&#39;: [98, 99, 100, 107, 9999, 10000, 10001, 850, 230, 121, 96, 97],
    &#39;ParentID&#39;: [97, 98, 99, 100, 1231, 1334, 850, 230, 121, 96, 0, 0],
    &#39;Filter Text&#39;: [None, None, None, &#39;1&#39;, None, None, &#39;2&#39;, None, None, None, None, None],
    &#39;Text&#39;: [&#39;AA&#39;, &#39;BB&#39;, &#39;CC&#39;, &#39;DD&#39;, &#39;EE&#39;, &#39;FF&#39;, &#39;GG&#39;, &#39;HH&#39;, &#39;II&#39;, &#39;JJ&#39;, &#39;KK&#39;, &#39;LL&#39;]
}
df = pd.DataFrame(data)

Initialize where you will keep your data:

df = pd.DataFrame(data)
df[&#39;Hierarchy&#39;] = &quot;&quot;
parent_child_dict = {}

The main logic to play with dictionaries

for index, row in df.iterrows():
    current_id = row[&#39;ID&#39;]
    parent_id = row[&#39;ParentID&#39;]
    
    parent_child_dict[current_id] = parent_id
for index, row in df.iterrows():
    hierarchy = []
    current_id = row[&#39;ID&#39;]
    
    while current_id != 0:
        parent_id = parent_child_dict.get(current_id)
        
        if parent_id is None:
            break
        
        parent_row = df.loc[df[&#39;ID&#39;] == parent_id]
        
        if parent_row.empty:
            break
        
        parent_text = parent_row[&#39;Text&#39;].values[0]
        hierarchy.insert(0, parent_text)
        current_id = parent_id
    
    hierarchy.append(row[&#39;Text&#39;])
    
    df.at[index, &#39;Hierarchy&#39;] = &#39;/&#39;.join(hierarchy)
print(df)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用pandas在数据框中创建层次结构，需要使用4列。

问题

答案1

潜在的优化

potential optimization

答案2

答案3

为什么我的字典数据在使用 openpyxl 制作的柱状图上显示不正确？

Pytestfs写入然后读取不返回预期值。

在Python或NumPy中已经有一些方法可以确定数字的格式吗？

是否可以链接多个Discord模态框？ [discord.py]

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。