英文:
Creating hierarchy using 4 columns in dataframe - pandas
问题
以下是翻译好的部分:
# 数据框如下
ID ParentID Filter Text
0 98 97 NULL AA
1 99 98 NULL BB
2 100 99 NULL CC
3 107 100 1 DD
4 9999 1231 NULL EE
5 10000 1334 NULL FF
6 10001 850 2 GG
7 850 230 NULL HH
8 230 121 NULL II
9 121 96 NULL JJ
10 96 0 NULL KK
11 97 0 NULL LL
# 我需要添加一个额外的列 "Hierarchy",如下
ID ParentID Filter Text Hierarchy
0 98 97 NULL AA
1 99 98 NULL BB
2 100 99 NULL CC
3 107 100 1 DD DD/CC/BB/AA/LL
4 9999 1231 NULL EE
5 10000 1334 NULL FF
6 10001 850 2 GG GG/HH/II/JJ/KK
7 850 230 NULL HH
8 230 121 NULL II
9 121 96 NULL JJ
10 96 0 NULL KK
11 97 0 NULL LL
# 我寻找的规则如下:
1) 只为具有非空 Filter 值的行填充 "Hierarchy" 列,其余行不需要处理层次结构。
2) 当发现具有非空 Filter 值的行时,查找其 ParentID,然后在 ID 列中查找此 ParentID。一直向上递归,直到父级ID为0。
3) 尝试使用 itertools,但循环太长,因为原始数据集很大。
4) 记录集大小约为200k。
由 mozway 提供的以下解决方案似乎有效,但对于包含200k条记录的记录集,耗费了很长时间。是否有办法可以加快解决方案的速度?
英文:
Dataframe is below
ID ParentID Filter Text
0 98 97 NULL AA
1 99 98 NULL BB
2 100 99 NULL CC
3 107 100 1 DD
4 9999 1231 NULL EE
5 10000 1334 NULL FF
6 10001 850 2 GG
7 850 230 NULL HH
8 230 121 NULL II
9 121 96 NULL JJ
10 96 0 NULL KK
11 97 0 NULL LL
I need to add an additional column hierarchy like this:
ID ParentID Filter Text Hierarchy
0 98 97 NULL AA
1 99 98 NULL BB
2 100 99 NULL CC
3 107 100 1 DD DD/CC/BB/AA/LL
4 9999 1231 NULL EE
5 10000 1334 NULL FF
6 10001 850 2 GG GG/HH/II/JJ/KK
7 850 230 NULL HH
8 230 121 NULL II
9 121 96 NULL JJ
10 96 0 NULL KK
11 97 0 NULL LL
The rules I am looking at are below:
-
Only populate hierarchy column for rows which have filter value populated, the rest of the rows don't need hierarchy done.
-
When a row is found having filter value not null, lookup its parentID, then search this parentid in ID column. When found reclusively keep going up till, parent id is 0.
-
Trying to do this with itertools but the looping is taking too long as the original dataset is huge
4)Recordset size is ~ 200k
The below solution provided kindly by mozway seems to work but for a recorset of 200k records, it takes a lot of time. Is there a tweak that can be done to this to get to the solution faster ?
答案1
得分: 3
这是一个图问题,你可以轻松使用 networkx
来解决。
- 使用
nx.from_pandas_edgelist
创建一个有向图。 - 使用列表推导和
in_degree
找到根节点。 - 使用
all_simple_paths
和映射系列获取选定节点和根节点之间的路径。 - 分配给你的 DataFrame
import networkx as nx
m = df['Filter'].notna()
nodes = df.loc[m, 'ID']
mapper = df[m].set_index('ID')['Text']
# 创建图
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID',
create_using=nx.DiGraph)
# 找到根节点
roots = {n for n, deg in G.in_degree() if deg == 0}
# {1231, 1334, 0}
# 检索层次结构
df.loc[m, 'Hierarchy'] = [
';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
for p in nx.all_simple_paths(G, r, n)])
for n in nodes for r in roots
for p in nx.all_simple_paths(G, r, n)
]
请注意,如果图形分支,可能会有多个层次结构。在这种情况下,它们将用 ;
分隔。
输出:
ID ParentID Filter Text Hierarchy
0 98 97 NaN AA NaN
1 99 98 NaN BB NaN
2 100 99 NaN CC NaN
3 107 100 1.0 DD DD/CC/BB/AA/LL
4 9999 1231 NaN EE NaN
5 10000 1334 NaN FF NaN
6 10001 850 2.0 GG GG/HH/II/JJ/KK
7 850 230 NaN HH NaN
8 230 121 NaN II NaN
9 121 96 NaN JJ NaN
10 96 0 NaN KK NaN
11 97 0 NaN LL NaN
图形:
潜在的优化
如果数据集很大,可能的优化是只遍历属于连接组件的根节点。在实际数据集中,你可以尝试看看是否可以提高性能。
import networkx as nx
m = df['Filter'].notna()
nodes = df.loc[m, 'ID']
mapper = df[m].set_index('ID')['Text']
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID', create_using=nx.DiGraph)
roots = {n for n, deg in G.in_degree() if deg == 0}
# {1231, 1334, 0}
roots_dict = {n: s & roots for s in nx.weakly_connected_components(G) for n in s}
df.loc[m, 'Hierarchy'] = [
';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
for p in nx.all_simple_paths(G, r, n)])
for n in nodes for r in roots_dict[n]
for p in nx.all_simple_paths(G, r, n)
]
英文:
This is a graph problem, which you can easily solve with networkx
.
- create a directed graph with
nx.from_pandas_edgelist
- find the roots with a list comprehension and
in_degree
- get the paths between the selected nodes and the roots using
all_simple_paths
and a mapping Series - assign to you DataFrame
import networkx as nx
m = df['Filter'].notna()
nodes = df.loc[m, 'ID']
mapper = df[m].set_index('ID')['Text']
# create graph
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID',
create_using=nx.DiGraph)
# find roots
roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}
# retrieve hierarchy
df.loc[m, 'Hierarchy'] = [
';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
for p in nx.all_simple_paths(G, r, n)])
for n in nodes for r in roots
for p in nx.all_simple_paths(G, r, n)
]
Note that there could be several hierarchies if the graph is branched. In this case, this would return all of them separated by ;
.
Output:
ID ParentID Filter Text Hierarchy
0 98 97 NaN AA NaN
1 99 98 NaN BB NaN
2 100 99 NaN CC NaN
3 107 100 1.0 DD DD/CC/BB/AA/LL
4 9999 1231 NaN EE NaN
5 10000 1334 NaN FF NaN
6 10001 850 2.0 GG GG/HH/II/JJ/KK
7 850 230 NaN HH NaN
8 230 121 NaN II NaN
9 121 96 NaN JJ NaN
10 96 0 NaN KK NaN
11 97 0 NaN LL NaN
Graph:
potential optimization
If the dataset is huge, a potential optimization might be to only iterate over the roots that are part of the connected components. You'd have to try in the real dataset if this improves performance.
import networkx as nx
m = df['Filter'].notna()
nodes = df.loc[m, 'ID']
mapper = df[m].set_index('ID')['Text']
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID', create_using=nx.DiGraph)
roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}
roots_dict = {n: s&roots for s in nx.weakly_connected_components(G) for n in s}
df.loc[m, 'Hierarchy'] = [
';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
for p in nx.all_simple_paths(G, r, n)])
for n in nodes for r in roots_dict[n]
for p in nx.all_simple_paths(G, r, n)
]
答案2
得分: 2
这是涉及每个根节点的一次深度优先搜索的解决方案。
最坏时间复杂度为 O(V + FV)
,其中 V
是列数,F
是要填充层次结构的列数。
这可能比其他解决方案更快,因为它利用了给定图是树的事实,因此从根到任何节点只有一条路径。
# 这是一个递归的深度优先搜索代码,附加逻辑用于存储有趣节点的层次结构
def dfs(graph, stack, interesting, hierarchy):
node = stack[-1]
for child in graph[node]:
stack.append(child)
if child in interesting:
hierarchy[child] = stack[:]
dfs(graph, stack, interesting, hierarchy)
stack.pop()
# 使 'ID' 成为索引
df = df.set_index("ID")
# 找到根节点
roots = df[df["ParentID"] == 0].index
# 有趣节点以查找层次结构
interesting = set(df[df["Filter"].notna()].index)
# 字典以存储 id -> 层次结构映射
hierarchy = {}
# 构建父节点 -> 子节点的邻接列表中的有向图
graph = defaultdict(list)
for node, parent in zip(df.index, df["ParentID"]):
graph[parent].append(node)
# 对每个根运行dfs
for root in roots:
stack = [root]
dfs(graph, stack, interesting, hierarchy)
# 填充层次列
df["Hierarchy"] = ""
for node, path in hierarchy.items():
df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])
# 再次使 'ID' 成为列
df = df.reset_index()
# 现在我们完成了!
print(df)
完整代码在 https://pastebin.com/6MFaqZQw。
英文:
Here is a solution involving one pass of dfs for each root node.
The worst time complexity is O(V + FV)
where V
is the number of columns and F
is
the number of columns to populate hierarchy for.
This might be faster than other solutions as it exploits the fact that the
given graph is a tree and hence there is only one path from root to any node.
# this is a recursive dfs code with additional logic to store the hierarchy
# of interesting nodes
def dfs(graph, stack, interesting, hierarchy):
node = stack[-1]
for child in graph[node]:
stack.append(child)
if child in interesting:
hierarchy[child] = stack[:]
dfs(graph, stack, interesting, hierarchy)
stack.pop()
# make 'ID' the index
df = df.set_index("ID")
# find the roots
roots = df[df["ParentID"] == 0].index
# interesting nodes to find the hierarchy for
interesting = set(df[df["Filter"].notna()].index)
# dict to store the id -> hierarchy mapping
hierarchy = {}
# build a directed graph in adjacency list of parent -> children
graph = defaultdict(list)
for node, parent in zip(df.index, df["ParentID"]):
graph[parent].append(node)
# run dfs on each root
for root in roots:
stack = [root]
dfs(graph, stack, interesting, hierarchy)
# populate the hierarchy column
df["Hierarchy"] = ""
for node, path in hierarchy.items():
df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])
# make 'ID' a column again
df = df.reset_index()
# now we're done!
print(df)
Full code is in <https://pastebin.com/6MFaqZQw>.
答案3
得分: 1
也许你可以尝试使用字典。不确定,但让我们看看。
创建一个测试数据框:
import pandas as pd
data = {
'ID': [98, 99, 100, 107, 9999, 10000, 10001, 850, 230, 121, 96, 97],
'ParentID': [97, 98, 99, 100, 1231, 1334, 850, 230, 121, 96, 0, 0],
'Filter Text': [None, None, None, '1', None, None, '2', None, None, None, None, None],
'Text': ['AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'GG', 'HH', 'II', 'JJ', 'KK', 'LL']
}
df = pd.DataFrame(data)
初始化保存数据的位置:
df = pd.DataFrame(data)
df['Hierarchy'] = ""
parent_child_dict = {}
处理字典的主要逻辑:
for index, row in df.iterrows():
current_id = row['ID']
parent_id = row['ParentID']
parent_child_dict[current_id] = parent_id
for index, row in df.iterrows():
hierarchy = []
current_id = row['ID']
while current_id != 0:
parent_id = parent_child_dict.get(current_id)
if parent_id is None:
break
parent_row = df.loc[df['ID'] == parent_id]
if parent_row.empty:
break
parent_text = parent_row['Text'].values[0]
hierarchy.insert(0, parent_text)
current_id = parent_id
hierarchy.append(row['Text'])
df.at[index, 'Hierarchy'] = '/'.join(hierarchy)
print(df)
这是你提供的代码的翻译部分。
英文:
Maybe you can try dictionaries. Not sure, but let's see.
Creating a test dataframe:
import pandas as pd
data = {
'ID': [98, 99, 100, 107, 9999, 10000, 10001, 850, 230, 121, 96, 97],
'ParentID': [97, 98, 99, 100, 1231, 1334, 850, 230, 121, 96, 0, 0],
'Filter Text': [None, None, None, '1', None, None, '2', None, None, None, None, None],
'Text': ['AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'GG', 'HH', 'II', 'JJ', 'KK', 'LL']
}
df = pd.DataFrame(data)
Initialize where you will keep your data:
df = pd.DataFrame(data)
df['Hierarchy'] = ""
parent_child_dict = {}
The main logic to play with dictionaries
for index, row in df.iterrows():
current_id = row['ID']
parent_id = row['ParentID']
parent_child_dict[current_id] = parent_id
for index, row in df.iterrows():
hierarchy = []
current_id = row['ID']
while current_id != 0:
parent_id = parent_child_dict.get(current_id)
if parent_id is None:
break
parent_row = df.loc[df['ID'] == parent_id]
if parent_row.empty:
break
parent_text = parent_row['Text'].values[0]
hierarchy.insert(0, parent_text)
current_id = parent_id
hierarchy.append(row['Text'])
df.at[index, 'Hierarchy'] = '/'.join(hierarchy)
print(df)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论