2023年6月19日 14:43:16go评论91阅读模式

英文:

How to groupby all the objects starts with of a dataframe in column?

问题

以下是翻译好的部分：

有一个数据框，我需要按组并求和所有数据
我已经使用正则表达式函数找到并分组了所有以各个国家开头的特定数据组。
假设我有一个数据集
    Hierarchy   Cost   Type    Region
    A   10  I   X
    A   10  I   Y
    A   10  J   X
    A   10  J   Y
    A/B 10  I   X
    A/B 10  I   Y
    A/B 10  J   X
    A/B 10  J   Y
    A/B/C   10  I   X
    A/B/C   10  I   Y
    A/B/C   10  J   X
    A/B/C   10  J   Y
    A/B/C   10  K   X
    A/B/C   10  K   Y
    
预期输出 :
    Hierarchy   Cost   Type    Region
    A   30  I   X
    A   30  I   Y
    A   30  J   X
    A   30  J   Y
    A   10  K   X
    A   10  K   Y
    A/B 20  I   X
    A/B 20  I   Y
    A/B 20  J   X
    A/B 20  J   Y
    A/B 10  K   X
    A/B 10  K   Y
    A/B/C   10  I   X
    A/B/C   10  I   Y
    A/B/C   10  J   X
    A/B/C   10  J   Y
    A/B/C   10  K   X
    A/B/C   10  K   Y
分组规则：
唯一棘手的是"Hierarchy"，我希望输出中的成本是它自身及其所有子层次结构的总和，例如：
    Hierarchy   Price   Type    Region
    A   30  I   X
这是在添加以下内容：
    A   10  I   X
    A/B   10  I   X
    A/B/C   10  I   X
另一种情况是：
    Hierarchy   Price   Type    Region
    A   10  K   Y
在输入数据中没有记录"Hierarchy" A在"Type" K中，但有：
    A/B/C   10  K   Y
因为A/B/C是A的子层次结构，所以
    A   10  K   Y
被添加到输出中。
    df.groupby(grp_df['Hierarchy'].str.split('/').str[0], sort=False).sum().reset_index()
但它只给出：
    Hierarchy   Cost          
    A   140

英文:

Have a dataframe where I need to group by and sum all the data

I have used regex function to find and group all the particular group of data starts with respective countries.

Suppose I have a dataset

Hierarchy   Cost   Type    Region
A   10  I   X
A   10  I   Y
A   10  J   X
A   10  J   Y
A/B 10  I   X
A/B 10  I   Y
A/B 10  J   X
A/B 10  J   Y
A/B/C   10  I   X
A/B/C   10  I   Y
A/B/C   10  J   X
A/B/C   10  J   Y
A/B/C   10  K   X
A/B/C   10  K   Y

Expected Output :

Hierarchy   Cost   Type    Region
A   30  I   X
A   30  I   Y
A   30  J   X
A   30  J   Y
A   10  K   X
A   10  K   Y
A/B 20  I   X
A/B 20  I   Y
A/B 20  J   X
A/B 20  J   Y
A/B 10  K   X
A/B 10  K   Y
A/B/C   10  I   X
A/B/C   10  I   Y
A/B/C   10  J   X
A/B/C   10  J   Y
A/B/C   10  K   X
A/B/C   10  K   Y

Rules for grouping:
The only thing tricky is the "Hierarchy", I'd like to get the cost in the output is the sum of itself and all its sub-hierarchy, for example:

Hierarchy   Price   Type    Region
A   30  I   X

which is adding:

A   10  I   X
A/B   10  I   X
A/B/C   10  I   X

Another case is:

Hierarchy   Price   Type    Region
A   10  K   Y

In the input date doesn't have record for "Hierarchy" A in "Type" K, but has:

A/B/C   10  K   Y

Since A/B/C is the sub-hierarchy of A, so

A   10  K   Y

is added to the output.

df.groupby(grp_df[&#39;Hierarchy&#39;].str.split(&#39;/&#39;).str[0], sort=False).sum().reset_index()

But it only gave:

Hierarchy   Cost          
A   140

答案1

得分: 2

以下是翻译好的部分：

# 步骤 1. 添加缺失的行
funcs = {col: 'first' for col in df.columns} | {'Cost': 'sum', 'Node': 'first'}
nodes = df['Hierarchy'].str.split('/')
# 步骤 2. 修复层次结构
paths = out['Hierarchy'].str.split('/', expand=True)
paths *= paths.eq(out['Node'], axis=0).replace(False, None).bfill(axis=1)
paths = paths.melt(ignore_index=False).dropna().groupby(level=0)['value'].agg('/'.join)
out['Hierarchy'] = paths

输出:

# 最终结果
>>> out
   Hierarchy  Cost Type Region Node
0          A    30    I      X    A
1          A    30    I      Y    A
2          A    30    J      X    A
3          A    30    J      Y    A
4          A    10    K      X    A
5          A    10    K      Y    A
6        A/B    20    I      X    B
7        A/B    20    I      Y    B
8        A/B    20    J      X    B
9        A/B    20    J      Y    B
10       A/B    10    K      X    B
11       A/B    10    K      Y    B
12     A/B/C    10    I      X    C
13     A/B/C    10    I
<details>
<summary>英文:</summary>
You can create the expected dataframe by exploding
With the same idea as your code, you can add missing rows using &#39;str.split&#39; and `explode`. However, you have to fix the `Hierarchy` column to get the expected dataframe:

Step 1. Add missing rows

funcs = {col: 'first' for col in df.columns} | {'Cost': 'sum', 'Node': 'first'}
nodes = df['Hierarchy'].str.split('/')
out = (df.assign(Node=nodes).explode('Node')
.groupby(['Node', 'Type', 'Region'], as_index=False)
.agg(funcs))

Step 2. Fix the hierarchy

paths = out['Hierarchy'].str.split('/', expand=True)
paths *= paths.eq(out['Node'], axis=0).replace(False, None).bfill(axis=1)
paths = paths.melt(ignore_index=False).dropna().groupby(level=0)['value'].agg('/'.join)
out['Hierarchy'] = paths


Output:

Final result

>>> out
Hierarchy Cost Type Region Node
0 A 30 I X A
1 A 30 I Y A
2 A 30 J X A
3 A 30 J Y A
4 A 10 K X A
5 A 10 K Y A
6 A/B 20 I X B
7 A/B 20 I Y B
8 A/B 20 J X B
9 A/B 20 J Y B
10 A/B 10 K X B
11 A/B 10 K Y B
12 A/B/C 10 I X C
13 A/B/C 10 I Y C
14 A/B/C 10 J X C
15 A/B/C 10 J Y C
16 A/B/C 10 K X C
17 A/B/C 10 K Y C

Before fix (before step 2)

>>> out
Hierarchy Cost Type Region Node # <- This column gives the depth of the path
0 A 30 I X A
1 A 30 I Y A
2 A 30 J X A
3 A 30 J Y A
4 A/B/C 10 K X A # Should be fixed, depth A
5 A/B/C 10 K Y A # Should be fixed, depth A
6 A/B 20 I X B
7 A/B 20 I Y B
8 A/B 20 J X B
9 A/B 20 J Y B
10 A/B/C 10 K X B # Should be fixed, depth B
11 A/B/C 10 K Y B # Should be fixed, depth B
12 A/B/C 10 I X C
13 A/B/C 10 I Y C
14 A/B/C 10 J X C
15 A/B/C 10 J Y C
16 A/B/C 10 K X C
17 A/B/C 10 K Y C


**Note**: the code above is probably a bit slow but it should work. You can also use `networkx` to build the graph and use `Cost` column as weight then find all the simple paths from roots to leaves and collect the weight.
</details>
# 答案2
**得分**: 0
你需要将层次列拆分成字母，类似这样：
```python
b = ''
for i in 'ABC':
    b += i
    df[b] = 0
    df.loc[df[df['Hierarchy'].str.contains(i)].index, b] = b

然后你可以像这样按创建的列进行分组：

df.groupby(by=['A','AB','ABC','Type','Region']).sum()

这段代码对我有效。

英文:

you have to split/divide the hierarchy column into letters, somethig like this:

b = &#39;&#39;
for i in &#39;ABC&#39;:
    b += i
    df[b] = 0
    df.loc[df[df[&#39;Hierarchy&#39;].str.contains(i)].index, b] = b

then you can group by the created columns like this:

df.groupby(by=[&#39;A&#39;,&#39;AB&#39;,&#39;ABC&#39;,&#39;Type&#39;,&#39;Region&#39;]).sum()

this code worked for me.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何按列对数据框中以某个对象开头的所有对象进行分组？

问题

答案1

Step 1. Add missing rows

Step 2. Fix the hierarchy

Final result

Before fix (before step 2)

改进多重洛伦兹拟合的 Python 实现。

为什么在使用 “with tbl.open()” 后，dbf 表仍然处于打开状态？

如何解决这个冒泡排序问题？

Python从与Python脚本不同的目录调用EXE。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。