How to add a new column in pandas Dataframe if the string or object value of column 1 is repeated in three consecutive rows

huangapple go评论71阅读模式
英文:

How to add a new column in pandas Dataframe if the string or object value of column 1 is repeated in three consecutive rows

问题

假设,我有一个像这样的数据框,

import pandas as pd
df = pd.DataFrame({'ID': ['p1305', 'p1305', 'p1305', 'p1307', 'p1307', 'p1307', 'p1301', 'p1301', 'p1301', 'p1340', 'p1340', 'p1340','P569','P987','P569']})

我需要添加一个名为y的列,如果ID列中的值连续三行相同,则在列y中添加"yes",否则添加"no"。

这是我尝试过的代码:

# 创建一个大小为3的滚动窗口
rolling = df['ID'].rolling(3)

# 对滚动窗口应用自定义函数以检查所有值是否相同
df['y'] = rolling.apply(lambda x: 'Yes' if all(x == x[0]) else 'No')

然而,上面的代码会引发以下错误:

DataError: No numeric types to aggregate

最终期望的输出是:

      ID        y
0   p1305  Yes
1   p1305  Yes
2   p1305  Yes
3   p1307  Yes
4   p1307  Yes
5   p1307  Yes
6   p1301  Yes
7   p1301  Yes
8   p1301  Yes
9   p1340  Yes
10  P1340  Yes
11  P1340  Yes

有任何建议或帮助将不胜感激!谢谢。

英文:

Say, I have a dataframe like this,

import pandas as pd
df = pd.DataFrame({'ID': ['p1305', 'p1305', 'p1305', 'p1307', 'p1307', 'p1307', 'p1301', 'p1301', 'p1301', 'p1340', 'p1340', 'p1340','P569','P987','P569']})

I need to add a column y if the values in ID are the same for three consecutive rows, then add yes in column y. Otherwise, add no.

Here is what I have tried,

# create a rolling window of size 3
rolling = df['ID'].rolling(3)

# apply a custom function to the rolling window to check if all values are the same
df['y'] = rolling.apply(lambda x: 'Yes' if all(x == x[0]) else 'No')

However, the above code is throwing the following error,

DataError: No numeric types to aggregate

The final desired output would be:

  ID        y
0   p1305  Yes
1   p1305  Yes
2   p1305  Yes
3   p1307  Yes
4   p1307  Yes
5   p1307  Yes
6   p1301  Yes
7   p1301  Yes
8   p1301  Yes
9   p1340  Yes
10  P1340  Yes
11  P1340  Yes

Any suggestions or help are much appreciated!
Thanks

答案1

得分: 1

你需要欺骗该方法并首先将其转换为数字,例如使用factorize(或Categorical):

df['y'] = (
 pd.Series(pd.factorize(df['ID'])[0], index=df.index)
   .rolling(3, min_periods=1).apply(lambda s: s.iloc[1:].eq(s.iloc[0]).all())
   .astype(bool)
)

输出:

       ID      y
0   p1305   True
1   p1305   True
2   p1305   True
3   p1307  False
4   p1307  False
5   p1307   True
6   p1301  False
7   p1301  False
8   p1301   True
9   p1340  False
10  p1340  False
11  p1340   True

如果你想要在分组的所有行中获得True,可以尝试另一种方法:

group = df['ID'].ne(df['ID'].shift()).cumsum()
df['y'] = df.groupby(group)['ID'].transform('size').eq(3) # 或 .ge(3)

输出:

       ID     y
0   p1305  True
1   p1305  True
2   p1305  True
3   p1307  True
4   p1307  True
5   p1307  True
6   p1301  True
7   p1301  True
8   p1301  True
9   p1340  True
10  p1340  True
11  p1340  True
英文:

You need to trick the method and convert to a number first, for exampe using factorize (or a Categorical):

df['y'] = (
 pd.Series(pd.factorize(df['ID'])[0], index=df.index)
   .rolling(3, min_periods=1).apply(lambda s: s.iloc[1:].eq(s.iloc[0]).all())
   .astype(bool)
)

Output:

       ID      y
0   p1305   True
1   p1305   True
2   p1305   True
3   p1307  False
4   p1307  False
5   p1307   True
6   p1301  False
7   p1301  False
8   p1301   True
9   p1340  False
10  p1340  False
11  p1340   True

Another approach if you want True in all the rows of the group, would be to use:

group = df['ID'].ne(df['ID'].shift()).cumsum()
df['y'] = df.groupby(group)['ID'].transform('size').eq(3) # or .ge(3)

Output:

       ID     y
0   p1305  True
1   p1305  True
2   p1305  True
3   p1307  True
4   p1307  True
5   p1307  True
6   p1301  True
7   p1301  True
8   p1301  True
9   p1340  True
10  p1340  True
11  p1340  True

huangapple
  • 本文由 发表于 2023年2月8日 19:39:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/75385262.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定