使用pandas在数据框中创建层次结构,需要使用4列。

huangapple go评论78阅读模式
英文:

Creating hierarchy using 4 columns in dataframe - pandas

问题

以下是翻译好的部分:

# 数据框如下
    ID        ParentID   Filter Text
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

# 我需要添加一个额外的列 "Hierarchy",如下
    ID        ParentID   Filter Text   Hierarchy
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD      DD/CC/BB/AA/LL
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG      GG/HH/II/JJ/KK
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

# 我寻找的规则如下:
1) 只为具有非空 Filter 值的行填充 "Hierarchy"其余行不需要处理层次结构
2) 当发现具有非空 Filter 值的行时查找其 ParentID然后在 ID 列中查找此 ParentID一直向上递归直到父级ID为0
3) 尝试使用 itertools但循环太长因为原始数据集很大
4) 记录集大小约为200k

由 mozway 提供的以下解决方案似乎有效但对于包含200k条记录的记录集耗费了很长时间是否有办法可以加快解决方案的速度
英文:

Dataframe is below

    ID        ParentID   Filter Text
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

I need to add an additional column hierarchy like this:

    ID        ParentID   Filter Text   Hierarchy
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD      DD/CC/BB/AA/LL
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG      GG/HH/II/JJ/KK
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

The rules I am looking at are below:

  1. Only populate hierarchy column for rows which have filter value populated, the rest of the rows don't need hierarchy done.

  2. When a row is found having filter value not null, lookup its parentID, then search this parentid in ID column. When found reclusively keep going up till, parent id is 0.

  3. Trying to do this with itertools but the looping is taking too long as the original dataset is huge

4)Recordset size is ~ 200k

The below solution provided kindly by mozway seems to work but for a recorset of 200k records, it takes a lot of time. Is there a tweak that can be done to this to get to the solution faster ?

答案1

得分: 3

这是一个图问题,你可以轻松使用 networkx 来解决。

import networkx as nx

m = df['Filter'].notna()
nodes = df.loc[m, 'ID']

mapper = df[m].set_index('ID')['Text']

# 创建图
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID',
                            create_using=nx.DiGraph)

# 找到根节点
roots = {n for n, deg in G.in_degree() if deg == 0}
# {1231, 1334, 0}

# 检索层次结构
df.loc[m, 'Hierarchy'] = [
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
              for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots
    for p in nx.all_simple_paths(G, r, n)
]

请注意,如果图形分支,可能会有多个层次结构。在这种情况下,它们将用 ; 分隔。

输出:

       ID  ParentID  Filter Text       Hierarchy
0      98        97     NaN   AA             NaN
1      99        98     NaN   BB             NaN
2     100        99     NaN   CC             NaN
3     107       100     1.0   DD  DD/CC/BB/AA/LL
4    9999      1231     NaN   EE             NaN
5   10000      1334     NaN   FF             NaN
6   10001       850     2.0   GG  GG/HH/II/JJ/KK
7     850       230     NaN   HH             NaN
8     230       121     NaN   II             NaN
9     121        96     NaN   JJ             NaN
10     96         0     NaN   KK             NaN
11     97         0     NaN   LL             NaN

图形:

使用pandas在数据框中创建层次结构,需要使用4列。

潜在的优化

如果数据集很大,可能的优化是只遍历属于连接组件的根节点。在实际数据集中,你可以尝试看看是否可以提高性能。

import networkx as nx

m = df['Filter'].notna()
nodes = df.loc[m, 'ID']

mapper = df[m].set_index('ID')['Text']

G = nx.from_pandas_edgelist(df, source='ParentID', target='ID', create_using=nx.DiGraph)

roots = {n for n, deg in G.in_degree() if deg == 0}
# {1231, 1334, 0}

roots_dict = {n: s & roots for s in nx.weakly_connected_components(G) for n in s}

df.loc[m, 'Hierarchy'] = [
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
              for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots_dict[n]
    for p in nx.all_simple_paths(G, r, n)
]
英文:

This is a graph problem, which you can easily solve with networkx.

import networkx as nx

m = df['Filter'].notna()
nodes = df.loc[m, 'ID']

mapper = df[m].set_index('ID')['Text']

# create graph
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID',
                            create_using=nx.DiGraph)

# find roots
roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}

# retrieve hierarchy
df.loc[m, 'Hierarchy'] = [
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
                        for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots
    for p in nx.all_simple_paths(G, r, n)
]

Note that there could be several hierarchies if the graph is branched. In this case, this would return all of them separated by ;.

Output:

       ID  ParentID  Filter Text       Hierarchy
0      98        97     NaN   AA             NaN
1      99        98     NaN   BB             NaN
2     100        99     NaN   CC             NaN
3     107       100     1.0   DD  DD/CC/BB/AA/LL
4    9999      1231     NaN   EE             NaN
5   10000      1334     NaN   FF             NaN
6   10001       850     2.0   GG  GG/HH/II/JJ/KK
7     850       230     NaN   HH             NaN
8     230       121     NaN   II             NaN
9     121        96     NaN   JJ             NaN
10     96         0     NaN   KK             NaN
11     97         0     NaN   LL             NaN

Graph:

使用pandas在数据框中创建层次结构,需要使用4列。

potential optimization

If the dataset is huge, a potential optimization might be to only iterate over the roots that are part of the connected components. You'd have to try in the real dataset if this improves performance.

import networkx as nx

m = df['Filter'].notna()
nodes = df.loc[m, 'ID']

mapper = df[m].set_index('ID')['Text']

G = nx.from_pandas_edgelist(df, source='ParentID', target='ID', create_using=nx.DiGraph)

roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}

roots_dict = {n: s&roots for s in nx.weakly_connected_components(G) for n in s}

df.loc[m, 'Hierarchy'] = [
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
                        for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots_dict[n]
    for p in nx.all_simple_paths(G, r, n)
]

答案2

得分: 2

这是涉及每个根节点的一次深度优先搜索的解决方案。
最坏时间复杂度为 O(V + FV),其中 V 是列数,F 是要填充层次结构的列数。
这可能比其他解决方案更快,因为它利用了给定图是树的事实,因此从根到任何节点只有一条路径。

# 这是一个递归的深度优先搜索代码,附加逻辑用于存储有趣节点的层次结构
def dfs(graph, stack, interesting, hierarchy):
    node = stack[-1]
    for child in graph[node]:
        stack.append(child)
        if child in interesting:
            hierarchy[child] = stack[:]
        dfs(graph, stack, interesting, hierarchy)
    stack.pop()


# 使 'ID' 成为索引
df = df.set_index("ID")

# 找到根节点
roots = df[df["ParentID"] == 0].index
# 有趣节点以查找层次结构
interesting = set(df[df["Filter"].notna()].index)
# 字典以存储 id -> 层次结构映射
hierarchy = {}

# 构建父节点 -> 子节点的邻接列表中的有向图
graph = defaultdict(list)
for node, parent in zip(df.index, df["ParentID"]):
    graph[parent].append(node)

# 对每个根运行dfs
for root in roots:
    stack = [root]
    dfs(graph, stack, interesting, hierarchy)

# 填充层次列
df["Hierarchy"] = ""
for node, path in hierarchy.items():
    df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])

# 再次使 'ID' 成为列
df = df.reset_index()

# 现在我们完成了!
print(df)

完整代码在 https://pastebin.com/6MFaqZQw

英文:

Here is a solution involving one pass of dfs for each root node.
The worst time complexity is O(V + FV) where V is the number of columns and F is
the number of columns to populate hierarchy for.
This might be faster than other solutions as it exploits the fact that the
given graph is a tree and hence there is only one path from root to any node.

# this is a recursive dfs code with additional logic to store the hierarchy
# of interesting nodes
def dfs(graph, stack, interesting, hierarchy):
    node = stack[-1]
    for child in graph[node]:
        stack.append(child)
        if child in interesting:
            hierarchy[child] = stack[:]
        dfs(graph, stack, interesting, hierarchy)
    stack.pop()


# make 'ID' the index
df = df.set_index("ID")

# find the roots
roots = df[df["ParentID"] == 0].index
# interesting nodes to find the hierarchy for
interesting = set(df[df["Filter"].notna()].index)
# dict to store the id -> hierarchy mapping
hierarchy = {}

# build a directed graph in adjacency list of parent -> children
graph = defaultdict(list)
for node, parent in zip(df.index, df["ParentID"]):
    graph[parent].append(node)

# run dfs on each root
for root in roots:
    stack = [root]
    dfs(graph, stack, interesting, hierarchy)

# populate the hierarchy column
df["Hierarchy"] = ""
for node, path in hierarchy.items():
    df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])

# make 'ID' a column again
df = df.reset_index()

# now we're done!
print(df)

Full code is in <https://pastebin.com/6MFaqZQw>.

答案3

得分: 1

也许你可以尝试使用字典。不确定,但让我们看看。
创建一个测试数据框:

import pandas as pd

data = {
    'ID': [98, 99, 100, 107, 9999, 10000, 10001, 850, 230, 121, 96, 97],
    'ParentID': [97, 98, 99, 100, 1231, 1334, 850, 230, 121, 96, 0, 0],
    'Filter Text': [None, None, None, '1', None, None, '2', None, None, None, None, None],
    'Text': ['AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'GG', 'HH', 'II', 'JJ', 'KK', 'LL']
}

df = pd.DataFrame(data)

初始化保存数据的位置:

df = pd.DataFrame(data)

df['Hierarchy'] = ""
parent_child_dict = {}

处理字典的主要逻辑:

for index, row in df.iterrows():
    current_id = row['ID']
    parent_id = row['ParentID']

    parent_child_dict[current_id] = parent_id

for index, row in df.iterrows():
    hierarchy = []
    current_id = row['ID']

    while current_id != 0:
        parent_id = parent_child_dict.get(current_id)

        if parent_id is None:
            break

        parent_row = df.loc[df['ID'] == parent_id]

        if parent_row.empty:
            break

        parent_text = parent_row['Text'].values[0]
        hierarchy.insert(0, parent_text)
        current_id = parent_id

    hierarchy.append(row['Text'])

    df.at[index, 'Hierarchy'] = '/'.join(hierarchy)
print(df)

这是你提供的代码的翻译部分。

英文:

Maybe you can try dictionaries. Not sure, but let's see.
Creating a test dataframe:

import pandas as pd

data = {
    &#39;ID&#39;: [98, 99, 100, 107, 9999, 10000, 10001, 850, 230, 121, 96, 97],
    &#39;ParentID&#39;: [97, 98, 99, 100, 1231, 1334, 850, 230, 121, 96, 0, 0],
    &#39;Filter Text&#39;: [None, None, None, &#39;1&#39;, None, None, &#39;2&#39;, None, None, None, None, None],
    &#39;Text&#39;: [&#39;AA&#39;, &#39;BB&#39;, &#39;CC&#39;, &#39;DD&#39;, &#39;EE&#39;, &#39;FF&#39;, &#39;GG&#39;, &#39;HH&#39;, &#39;II&#39;, &#39;JJ&#39;, &#39;KK&#39;, &#39;LL&#39;]
}

df = pd.DataFrame(data)

Initialize where you will keep your data:

df = pd.DataFrame(data)

df[&#39;Hierarchy&#39;] = &quot;&quot;

parent_child_dict = {}

The main logic to play with dictionaries

for index, row in df.iterrows():
    current_id = row[&#39;ID&#39;]
    parent_id = row[&#39;ParentID&#39;]
    
    parent_child_dict[current_id] = parent_id

for index, row in df.iterrows():
    hierarchy = []
    current_id = row[&#39;ID&#39;]
    
    while current_id != 0:
        parent_id = parent_child_dict.get(current_id)
        
        if parent_id is None:
            break
        
        parent_row = df.loc[df[&#39;ID&#39;] == parent_id]
        
        if parent_row.empty:
            break
        
        parent_text = parent_row[&#39;Text&#39;].values[0]
        hierarchy.insert(0, parent_text)
        current_id = parent_id
    
    hierarchy.append(row[&#39;Text&#39;])
    
    df.at[index, &#39;Hierarchy&#39;] = &#39;/&#39;.join(hierarchy)
print(df)

huangapple
  • 本文由 发表于 2023年6月2日 13:17:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76387309.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定