2023年8月9日 14:42:11go评论102阅读模式

英文:

Aggregate rows in pandas

问题

我有很多类似的pandas行，如下所示：

日期	位置
2023-08-01 12:01:00	A23
2023-08-01 12:20:00	A23
2023-08-01 13:10:10	A23
2023-08-02 12:00:00	B12
2023-08-02 12:01:00	A23
2023-08-02 12:05:00	A23

我需要按“位置”汇总值，并合并日期时间范围，如下所示：

日期	日期2	位置
2023-08-01 12:01:00	2023-08-01 13:10:10	A23
2023-08-02 12:00:00	NaN	B12
2023-08-02 12:01:00	2023-08-02 12:05:00	A23

谢谢

英文:

I have many similar rows in pandas like this:

Date	Position
2023-08-01 12:01:00	A23
2023-08-01 12:20:00	A23
2023-08-01 13:10:10	A23
2023-08-02 12:00:00	B12
2023-08-02 12:01:00	A23
2023-08-02 12:05:00	A23

and Im need to aggregate values by "Position" and merge Datetime range like this:

Date	Date2	Position
2023-08-01 12:01:00	2023-08-01 13:10:10	A23
2023-08-02 12:00:00	NaN	B12
2023-08-02 12:01:00	2023-08-02 12:05:00	A23

Thank you

答案1

得分: 1

假设您想要每组相同连续位置的最小/最大日期，并使用自定义的 groupby.agg 进行后处理：

# 确保日期时间格式
df['Date'] = pd.to_datetime(df['Date'])
# 分组连续位置
group = df['Position'].ne(df['Position'].shift()).cumsum()
out = (df
   .groupby(group, as_index=False)
   .agg(Date=('Date', 'min'),
        Date2=('Date', 'max'),
        Position=('Position', 'first'),
        n=('Position', 'count')
       )
   # 如果组内不超过1个项目，则隐藏Date2
   # 也可以检查Date ≠ Date2
   .assign(Date2=lambda d: d['Date2'].where(d.pop('n').gt(1)))
)

注：要按位置和日期分组，请使用 .groupby(['Position', df['Date'].dt.normalize()], as_index=False)。

输出：

                 Date               Date2 Position
0 2023-08-01 12:01:00 2023-08-01 13:10:10      A23
1 2023-08-02 12:00:00                 NaT      B12
2 2023-08-02 12:01:00 2023-08-02 12:05:00      A23

英文:

Assuming you want you min/max date per groups of identical successive positions, and using a custom groupby.agg with post-processing:

# ensure datetime
df[&#39;Date&#39;] = pd.to_datetime(df[&#39;Date&#39;])
# group successive positions
group = df[&#39;Position&#39;].ne(df[&#39;Position&#39;].shift()).cumsum()
out = (df
   .groupby(group, as_index=False)
   .agg(Date=(&#39;Date&#39;, &#39;min&#39;),
        Date2=(&#39;Date&#39;, &#39;max&#39;),
        Position=(&#39;Position&#39;, &#39;first&#39;),
        n=(&#39;Position&#39;, &#39;count&#39;)
       )
   # hide Date2 if there was not more than 1 item in the group
   # you could also check that Date ≠ Date2
   .assign(Date2=lambda d: d[&#39;Date2&#39;].where(d.pop(&#39;n&#39;).gt(1)))
)

NB. to group by position and day, use .groupby(['Position', df['Date'].dt.normalize()], as_index=False).

Output:

                 Date               Date2 Position
0 2023-08-01 12:01:00 2023-08-01 13:10:10      A23
1 2023-08-02 12:00:00                 NaT      B12
2 2023-08-02 12:01:00 2023-08-02 12:05:00      A23

答案2

得分: 1

import pandas as pd
from io import StringIO
from pandas import Timestamp
df = pd.DataFrame(
    {'Date': {0: Timestamp('2023-08-01 12:01:00'), 
              1: Timestamp('2023-08-01 12:20:00'), 
              2: Timestamp('2023-08-01 13:10:10'), 
              3: Timestamp('2023-08-02 12:00:00'), 
              4: Timestamp('2023-08-02 12:01:00'), 
              5: Timestamp('2023-08-02 12:05:00')}, 
    'Position': {0: 'A23', 
                 1: 'A23', 
                 2: 'A23', 
                 3: 'B12', 
                 4: 'A23', 
                 5: 'A23'}}
)
# 检查 Position 值是否与前一行的 Position 值相同
df['Group'] = (df['Position'] != df['Position'].shift()).cumsum()
# 按照 group 和 position 列分组，然后获取 date 列的最小值和最大值，然后删除 group 列
df = df.groupby(['Group', 'Position'])['Date'].agg(['min', 'max']).reset_index().drop('Group', axis=1)
# 如果 max == min，则 max 应该为 NaN
df['max'] = df['max'].where(df['max'] != df['min'])
# 重命名列名为所需的名称
df.rename(columns={'min': 'Date1', 'max': 'Date2'}, inplace=True)
# 输出:
>>> df
  Position               Date1               Date2
0      A23 2023-08-01 12:01:00 2023-08-01 13:10:10
1      B12 2023-08-02 12:00:00                 NaT
2      A23 2023-08-02 12:01:00 2023-08-02 12:05:00

英文:

import pandas as pd
from io import StringIO
from pandas import Timestamp
df = pd.DataFrame(
    {&#39;Date&#39;: {0: Timestamp(&#39;2023-08-01 12:01:00&#39;), 
              1: Timestamp(&#39;2023-08-01 12:20:00&#39;), 
              2: Timestamp(&#39;2023-08-01 13:10:10&#39;), 
              3: Timestamp(&#39;2023-08-02 12:00:00&#39;), 
              4: Timestamp(&#39;2023-08-02 12:01:00&#39;), 
              5: Timestamp(&#39;2023-08-02 12:05:00&#39;)}, 
    &#39;Position&#39;: {0: &#39;A23&#39;, 
                 1: &#39;A23&#39;, 
                 2: &#39;A23&#39;, 
                 3: &#39;B12&#39;, 
                 4: &#39;A23&#39;, 
                 5: &#39;A23&#39;}}
)
# check if the Position value is the same as the previous row&#39;s Position value
df[&#39;Group&#39;] = (df[&#39;Position&#39;] != df[&#39;Position&#39;].shift()).cumsum()
# group by the group and position columns, then get the min and max of the date column, then drop the group column
df = df.groupby([&#39;Group&#39;, &#39;Position&#39;])[&#39;Date&#39;].agg([&#39;min&#39;, &#39;max&#39;]).reset_index().drop(&#39;Group&#39;, axis=1)
# if max == min, then max should be NaN
df[&#39;max&#39;] = df[&#39;max&#39;].where(df[&#39;max&#39;] != df[&#39;min&#39;])
# rename the columns to the desired names
df.rename(columns={&#39;min&#39;: &#39;Date1&#39;, &#39;max&#39;: &#39;Date2&#39;}, inplace=True)
# Output:
&gt;&gt;&gt; df
  Position               Date1               Date2
0      A23 2023-08-01 12:01:00 2023-08-01 13:10:10
1      B12 2023-08-02 12:00:00                 NaT
2      A23 2023-08-02 12:01:00 2023-08-02 12:05:00
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在pandas中合并行。

问题

答案1

答案2

Error converting pandas dataframe to xml saying Invalid Tag Name

将宽格式数据（分开的数据框）使用Python转换为长格式。

计算数据框中的唯一值，然后在分组时将该值附加在字符串前面

如何使用pyarrow和parquet对pandas DataFrame进行加密

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。