英文:
How to groupby all the objects starts with of a dataframe in column?
问题
以下是翻译好的部分:
有一个数据框,我需要按组并求和所有数据
我已经使用正则表达式函数找到并分组了所有以各个国家开头的特定数据组。
假设我有一个数据集
Hierarchy Cost Type Region
A 10 I X
A 10 I Y
A 10 J X
A 10 J Y
A/B 10 I X
A/B 10 I Y
A/B 10 J X
A/B 10 J Y
A/B/C 10 I X
A/B/C 10 I Y
A/B/C 10 J X
A/B/C 10 J Y
A/B/C 10 K X
A/B/C 10 K Y
预期输出 :
Hierarchy Cost Type Region
A 30 I X
A 30 I Y
A 30 J X
A 30 J Y
A 10 K X
A 10 K Y
A/B 20 I X
A/B 20 I Y
A/B 20 J X
A/B 20 J Y
A/B 10 K X
A/B 10 K Y
A/B/C 10 I X
A/B/C 10 I Y
A/B/C 10 J X
A/B/C 10 J Y
A/B/C 10 K X
A/B/C 10 K Y
分组规则:
唯一棘手的是"Hierarchy",我希望输出中的成本是它自身及其所有子层次结构的总和,例如:
Hierarchy Price Type Region
A 30 I X
这是在添加以下内容:
A 10 I X
A/B 10 I X
A/B/C 10 I X
另一种情况是:
Hierarchy Price Type Region
A 10 K Y
在输入数据中没有记录"Hierarchy" A在"Type" K中,但有:
A/B/C 10 K Y
因为A/B/C是A的子层次结构,所以
A 10 K Y
被添加到输出中。
df.groupby(grp_df['Hierarchy'].str.split('/').str[0], sort=False).sum().reset_index()
但它只给出:
Hierarchy Cost
A 140
英文:
Have a dataframe where I need to group by and sum all the data
I have used regex function to find and group all the particular group of data starts with respective countries.
Suppose I have a dataset
Hierarchy Cost Type Region
A 10 I X
A 10 I Y
A 10 J X
A 10 J Y
A/B 10 I X
A/B 10 I Y
A/B 10 J X
A/B 10 J Y
A/B/C 10 I X
A/B/C 10 I Y
A/B/C 10 J X
A/B/C 10 J Y
A/B/C 10 K X
A/B/C 10 K Y
Expected Output :
Hierarchy Cost Type Region
A 30 I X
A 30 I Y
A 30 J X
A 30 J Y
A 10 K X
A 10 K Y
A/B 20 I X
A/B 20 I Y
A/B 20 J X
A/B 20 J Y
A/B 10 K X
A/B 10 K Y
A/B/C 10 I X
A/B/C 10 I Y
A/B/C 10 J X
A/B/C 10 J Y
A/B/C 10 K X
A/B/C 10 K Y
Rules for grouping:
The only thing tricky is the "Hierarchy", I'd like to get the cost in the output is the sum of itself and all its sub-hierarchy, for example:
Hierarchy Price Type Region
A 30 I X
which is adding:
A 10 I X
A/B 10 I X
A/B/C 10 I X
Another case is:
Hierarchy Price Type Region
A 10 K Y
In the input date doesn't have record for "Hierarchy" A in "Type" K, but has:
A/B/C 10 K Y
Since A/B/C is the sub-hierarchy of A, so
A 10 K Y
is added to the output.
df.groupby(grp_df['Hierarchy'].str.split('/').str[0], sort=False).sum().reset_index()
But it only gave:
Hierarchy Cost
A 140
答案1
得分: 2
以下是翻译好的部分:
# 步骤 1. 添加缺失的行
funcs = {col: 'first' for col in df.columns} | {'Cost': 'sum', 'Node': 'first'}
nodes = df['Hierarchy'].str.split('/')
# 步骤 2. 修复层次结构
paths = out['Hierarchy'].str.split('/', expand=True)
paths *= paths.eq(out['Node'], axis=0).replace(False, None).bfill(axis=1)
paths = paths.melt(ignore_index=False).dropna().groupby(level=0)['value'].agg('/'.join)
out['Hierarchy'] = paths
输出:
# 最终结果
>>> out
Hierarchy Cost Type Region Node
0 A 30 I X A
1 A 30 I Y A
2 A 30 J X A
3 A 30 J Y A
4 A 10 K X A
5 A 10 K Y A
6 A/B 20 I X B
7 A/B 20 I Y B
8 A/B 20 J X B
9 A/B 20 J Y B
10 A/B 10 K X B
11 A/B 10 K Y B
12 A/B/C 10 I X C
13 A/B/C 10 I
<details>
<summary>英文:</summary>
You can create the expected dataframe by exploding
With the same idea as your code, you can add missing rows using 'str.split' and `explode`. However, you have to fix the `Hierarchy` column to get the expected dataframe:
Step 1. Add missing rows
funcs = {col: 'first' for col in df.columns} | {'Cost': 'sum', 'Node': 'first'}
nodes = df['Hierarchy'].str.split('/')
out = (df.assign(Node=nodes).explode('Node')
.groupby(['Node', 'Type', 'Region'], as_index=False)
.agg(funcs))
Step 2. Fix the hierarchy
paths = out['Hierarchy'].str.split('/', expand=True)
paths *= paths.eq(out['Node'], axis=0).replace(False, None).bfill(axis=1)
paths = paths.melt(ignore_index=False).dropna().groupby(level=0)['value'].agg('/'.join)
out['Hierarchy'] = paths
Output:
Final result
>>> out
Hierarchy Cost Type Region Node
0 A 30 I X A
1 A 30 I Y A
2 A 30 J X A
3 A 30 J Y A
4 A 10 K X A
5 A 10 K Y A
6 A/B 20 I X B
7 A/B 20 I Y B
8 A/B 20 J X B
9 A/B 20 J Y B
10 A/B 10 K X B
11 A/B 10 K Y B
12 A/B/C 10 I X C
13 A/B/C 10 I Y C
14 A/B/C 10 J X C
15 A/B/C 10 J Y C
16 A/B/C 10 K X C
17 A/B/C 10 K Y C
Before fix (before step 2)
>>> out
Hierarchy Cost Type Region Node # <- This column gives the depth of the path
0 A 30 I X A
1 A 30 I Y A
2 A 30 J X A
3 A 30 J Y A
4 A/B/C 10 K X A # Should be fixed, depth A
5 A/B/C 10 K Y A # Should be fixed, depth A
6 A/B 20 I X B
7 A/B 20 I Y B
8 A/B 20 J X B
9 A/B 20 J Y B
10 A/B/C 10 K X B # Should be fixed, depth B
11 A/B/C 10 K Y B # Should be fixed, depth B
12 A/B/C 10 I X C
13 A/B/C 10 I Y C
14 A/B/C 10 J X C
15 A/B/C 10 J Y C
16 A/B/C 10 K X C
17 A/B/C 10 K Y C
**Note**: the code above is probably a bit slow but it should work. You can also use `networkx` to build the graph and use `Cost` column as weight then find all the simple paths from roots to leaves and collect the weight.
</details>
# 答案2
**得分**: 0
你需要将层次列拆分成字母,类似这样:
```python
b = ''
for i in 'ABC':
b += i
df[b] = 0
df.loc[df[df['Hierarchy'].str.contains(i)].index, b] = b
然后你可以像这样按创建的列进行分组:
df.groupby(by=['A','AB','ABC','Type','Region']).sum()
这段代码对我有效。
英文:
you have to split/divide the hierarchy column into letters, somethig like this:
b = ''
for i in 'ABC':
b += i
df[b] = 0
df.loc[df[df['Hierarchy'].str.contains(i)].index, b] = b
then you can group by the created columns like this:
df.groupby(by=['A','AB','ABC','Type','Region']).sum()
this code worked for me.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论