如何按列对数据框中以某个对象开头的所有对象进行分组?

huangapple go评论60阅读模式
英文:

How to groupby all the objects starts with of a dataframe in column?

问题

以下是翻译好的部分:

有一个数据框,我需要按组并求和所有数据

我已经使用正则表达式函数找到并分组了所有以各个国家开头的特定数据组。

假设我有一个数据集

    Hierarchy   Cost   Type    Region
    A   10  I   X
    A   10  I   Y
    A   10  J   X
    A   10  J   Y
    A/B 10  I   X
    A/B 10  I   Y
    A/B 10  J   X
    A/B 10  J   Y
    A/B/C   10  I   X
    A/B/C   10  I   Y
    A/B/C   10  J   X
    A/B/C   10  J   Y
    A/B/C   10  K   X
    A/B/C   10  K   Y
    
预期输出 :

    Hierarchy   Cost   Type    Region
    A   30  I   X
    A   30  I   Y
    A   30  J   X
    A   30  J   Y
    A   10  K   X
    A   10  K   Y
    A/B 20  I   X
    A/B 20  I   Y
    A/B 20  J   X
    A/B 20  J   Y
    A/B 10  K   X
    A/B 10  K   Y
    A/B/C   10  I   X
    A/B/C   10  I   Y
    A/B/C   10  J   X
    A/B/C   10  J   Y
    A/B/C   10  K   X
    A/B/C   10  K   Y

分组规则:
唯一棘手的是"Hierarchy",我希望输出中的成本是它自身及其所有子层次结构的总和,例如:

    Hierarchy   Price   Type    Region
    A   30  I   X

这是在添加以下内容:

    A   10  I   X
    A/B   10  I   X
    A/B/C   10  I   X

另一种情况是:

    Hierarchy   Price   Type    Region
    A   10  K   Y

在输入数据中没有记录"Hierarchy" A在"Type" K中,但有:

    A/B/C   10  K   Y

因为A/B/C是A的子层次结构,所以

    A   10  K   Y

被添加到输出中。

    df.groupby(grp_df['Hierarchy'].str.split('/').str[0], sort=False).sum().reset_index()

但它只给出:

    Hierarchy   Cost          
    A   140
英文:

Have a dataframe where I need to group by and sum all the data

I have used regex function to find and group all the particular group of data starts with respective countries.

Suppose I have a dataset

Hierarchy   Cost   Type    Region
A   10  I   X
A   10  I   Y
A   10  J   X
A   10  J   Y
A/B 10  I   X
A/B 10  I   Y
A/B 10  J   X
A/B 10  J   Y
A/B/C   10  I   X
A/B/C   10  I   Y
A/B/C   10  J   X
A/B/C   10  J   Y
A/B/C   10  K   X
A/B/C   10  K   Y

Expected Output :

Hierarchy   Cost   Type    Region
A   30  I   X
A   30  I   Y
A   30  J   X
A   30  J   Y
A   10  K   X
A   10  K   Y
A/B 20  I   X
A/B 20  I   Y
A/B 20  J   X
A/B 20  J   Y
A/B 10  K   X
A/B 10  K   Y
A/B/C   10  I   X
A/B/C   10  I   Y
A/B/C   10  J   X
A/B/C   10  J   Y
A/B/C   10  K   X
A/B/C   10  K   Y

Rules for grouping:
The only thing tricky is the "Hierarchy", I'd like to get the cost in the output is the sum of itself and all its sub-hierarchy, for example:

Hierarchy   Price   Type    Region
A   30  I   X

which is adding:

A   10  I   X
A/B   10  I   X
A/B/C   10  I   X

Another case is:

Hierarchy   Price   Type    Region
A   10  K   Y

In the input date doesn't have record for "Hierarchy" A in "Type" K, but has:

A/B/C   10  K   Y

Since A/B/C is the sub-hierarchy of A, so

A   10  K   Y

is added to the output.

df.groupby(grp_df['Hierarchy'].str.split('/').str[0], sort=False).sum().reset_index()

But it only gave:

Hierarchy   Cost          
A   140

答案1

得分: 2

以下是翻译好的部分:

# 步骤 1. 添加缺失的行
funcs = {col: 'first' for col in df.columns} | {'Cost': 'sum', 'Node': 'first'}
nodes = df['Hierarchy'].str.split('/')

# 步骤 2. 修复层次结构
paths = out['Hierarchy'].str.split('/', expand=True)
paths *= paths.eq(out['Node'], axis=0).replace(False, None).bfill(axis=1)
paths = paths.melt(ignore_index=False).dropna().groupby(level=0)['value'].agg('/'.join)
out['Hierarchy'] = paths

输出:

# 最终结果
>>> out
   Hierarchy  Cost Type Region Node
0          A    30    I      X    A
1          A    30    I      Y    A
2          A    30    J      X    A
3          A    30    J      Y    A
4          A    10    K      X    A
5          A    10    K      Y    A
6        A/B    20    I      X    B
7        A/B    20    I      Y    B
8        A/B    20    J      X    B
9        A/B    20    J      Y    B
10       A/B    10    K      X    B
11       A/B    10    K      Y    B
12     A/B/C    10    I      X    C
13     A/B/C    10    I

<details>
<summary>英文:</summary>

You can create the expected dataframe by exploding
With the same idea as your code, you can add missing rows using &#39;str.split&#39; and `explode`. However, you have to fix the `Hierarchy` column to get the expected dataframe:

Step 1. Add missing rows

funcs = {col: 'first' for col in df.columns} | {'Cost': 'sum', 'Node': 'first'}
nodes = df['Hierarchy'].str.split('/')
out = (df.assign(Node=nodes).explode('Node')
.groupby(['Node', 'Type', 'Region'], as_index=False)
.agg(funcs))

Step 2. Fix the hierarchy

paths = out['Hierarchy'].str.split('/', expand=True)
paths *= paths.eq(out['Node'], axis=0).replace(False, None).bfill(axis=1)
paths = paths.melt(ignore_index=False).dropna().groupby(level=0)['value'].agg('/'.join)
out['Hierarchy'] = paths


Output:

Final result

>>> out
Hierarchy Cost Type Region Node
0 A 30 I X A
1 A 30 I Y A
2 A 30 J X A
3 A 30 J Y A
4 A 10 K X A
5 A 10 K Y A
6 A/B 20 I X B
7 A/B 20 I Y B
8 A/B 20 J X B
9 A/B 20 J Y B
10 A/B 10 K X B
11 A/B 10 K Y B
12 A/B/C 10 I X C
13 A/B/C 10 I Y C
14 A/B/C 10 J X C
15 A/B/C 10 J Y C
16 A/B/C 10 K X C
17 A/B/C 10 K Y C

Before fix (before step 2)

>>> out
Hierarchy Cost Type Region Node # <- This column gives the depth of the path
0 A 30 I X A
1 A 30 I Y A
2 A 30 J X A
3 A 30 J Y A
4 A/B/C 10 K X A # Should be fixed, depth A
5 A/B/C 10 K Y A # Should be fixed, depth A
6 A/B 20 I X B
7 A/B 20 I Y B
8 A/B 20 J X B
9 A/B 20 J Y B
10 A/B/C 10 K X B # Should be fixed, depth B
11 A/B/C 10 K Y B # Should be fixed, depth B
12 A/B/C 10 I X C
13 A/B/C 10 I Y C
14 A/B/C 10 J X C
15 A/B/C 10 J Y C
16 A/B/C 10 K X C
17 A/B/C 10 K Y C


**Note**: the code above is probably a bit slow but it should work. You can also use `networkx` to build the graph and use `Cost` column as weight then find all the simple paths from roots to leaves and collect the weight.

</details>



# 答案2
**得分**: 0

你需要将层次列拆分成字母,类似这样:

```python
b = ''
for i in 'ABC':
    b += i
    df[b] = 0
    df.loc[df[df['Hierarchy'].str.contains(i)].index, b] = b

然后你可以像这样按创建的列进行分组:

df.groupby(by=['A','AB','ABC','Type','Region']).sum()

这段代码对我有效。

英文:

you have to split/divide the hierarchy column into letters, somethig like this:

b = &#39;&#39;
for i in &#39;ABC&#39;:
    b += i
    df[b] = 0
    df.loc[df[df[&#39;Hierarchy&#39;].str.contains(i)].index, b] = b

then you can group by the created columns like this:

df.groupby(by=[&#39;A&#39;,&#39;AB&#39;,&#39;ABC&#39;,&#39;Type&#39;,&#39;Region&#39;]).sum()

this code worked for me.

huangapple
  • 本文由 发表于 2023年6月19日 14:43:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76504200.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定