英文:
How to add a new column in pandas Dataframe if the string or object value of column 1 is repeated in three consecutive rows
问题
假设,我有一个像这样的数据框,
import pandas as pd
df = pd.DataFrame({'ID': ['p1305', 'p1305', 'p1305', 'p1307', 'p1307', 'p1307', 'p1301', 'p1301', 'p1301', 'p1340', 'p1340', 'p1340','P569','P987','P569']})
我需要添加一个名为y的列,如果ID列中的值连续三行相同,则在列y中添加"yes",否则添加"no"。
这是我尝试过的代码:
# 创建一个大小为3的滚动窗口
rolling = df['ID'].rolling(3)
# 对滚动窗口应用自定义函数以检查所有值是否相同
df['y'] = rolling.apply(lambda x: 'Yes' if all(x == x[0]) else 'No')
然而,上面的代码会引发以下错误:
DataError: No numeric types to aggregate
最终期望的输出是:
ID y
0 p1305 Yes
1 p1305 Yes
2 p1305 Yes
3 p1307 Yes
4 p1307 Yes
5 p1307 Yes
6 p1301 Yes
7 p1301 Yes
8 p1301 Yes
9 p1340 Yes
10 P1340 Yes
11 P1340 Yes
有任何建议或帮助将不胜感激!谢谢。
英文:
Say, I have a dataframe like this,
import pandas as pd
df = pd.DataFrame({'ID': ['p1305', 'p1305', 'p1305', 'p1307', 'p1307', 'p1307', 'p1301', 'p1301', 'p1301', 'p1340', 'p1340', 'p1340','P569','P987','P569']})
I need to add a column y if the values in ID are the same for three consecutive rows, then add yes in column y. Otherwise, add no.
Here is what I have tried,
# create a rolling window of size 3
rolling = df['ID'].rolling(3)
# apply a custom function to the rolling window to check if all values are the same
df['y'] = rolling.apply(lambda x: 'Yes' if all(x == x[0]) else 'No')
However, the above code is throwing the following error,
DataError: No numeric types to aggregate
The final desired output would be:
ID y
0 p1305 Yes
1 p1305 Yes
2 p1305 Yes
3 p1307 Yes
4 p1307 Yes
5 p1307 Yes
6 p1301 Yes
7 p1301 Yes
8 p1301 Yes
9 p1340 Yes
10 P1340 Yes
11 P1340 Yes
Any suggestions or help are much appreciated!
Thanks
答案1
得分: 1
你需要欺骗该方法并首先将其转换为数字,例如使用factorize
(或Categorical
):
df['y'] = (
pd.Series(pd.factorize(df['ID'])[0], index=df.index)
.rolling(3, min_periods=1).apply(lambda s: s.iloc[1:].eq(s.iloc[0]).all())
.astype(bool)
)
输出:
ID y
0 p1305 True
1 p1305 True
2 p1305 True
3 p1307 False
4 p1307 False
5 p1307 True
6 p1301 False
7 p1301 False
8 p1301 True
9 p1340 False
10 p1340 False
11 p1340 True
如果你想要在分组的所有行中获得True,可以尝试另一种方法:
group = df['ID'].ne(df['ID'].shift()).cumsum()
df['y'] = df.groupby(group)['ID'].transform('size').eq(3) # 或 .ge(3)
输出:
ID y
0 p1305 True
1 p1305 True
2 p1305 True
3 p1307 True
4 p1307 True
5 p1307 True
6 p1301 True
7 p1301 True
8 p1301 True
9 p1340 True
10 p1340 True
11 p1340 True
英文:
You need to trick the method and convert to a number first, for exampe using factorize
(or a Categorical
):
df['y'] = (
pd.Series(pd.factorize(df['ID'])[0], index=df.index)
.rolling(3, min_periods=1).apply(lambda s: s.iloc[1:].eq(s.iloc[0]).all())
.astype(bool)
)
Output:
ID y
0 p1305 True
1 p1305 True
2 p1305 True
3 p1307 False
4 p1307 False
5 p1307 True
6 p1301 False
7 p1301 False
8 p1301 True
9 p1340 False
10 p1340 False
11 p1340 True
Another approach if you want True in all the rows of the group, would be to use:
group = df['ID'].ne(df['ID'].shift()).cumsum()
df['y'] = df.groupby(group)['ID'].transform('size').eq(3) # or .ge(3)
Output:
ID y
0 p1305 True
1 p1305 True
2 p1305 True
3 p1307 True
4 p1307 True
5 p1307 True
6 p1301 True
7 p1301 True
8 p1301 True
9 p1340 True
10 p1340 True
11 p1340 True
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论