英文:
Flag repeating entries in pandas time series
问题
我理解你的问题。你想要标记那些参与者在某一年去了他们前一年没有去过的度假地点的行。以下是你描述的问题的解决方案:
# 将'year'列转换为日期时间类型
df['year'] = pd.to_datetime(df['year'], format='%Y')
# 根据'id'对数据框进行分组
grouped = df.groupby('id')
# 添加一个'new'列,表示是否是新的度假地点
df['new'] = grouped['vacation'].transform(lambda x: x.ne(x.shift()))
# 输出结果
print(df)
这将给你想要的输出,标记了参与者在某一年去了他们前一年没有去过的度假地点的行。希望这对你有所帮助!
英文:
I have a data frame that takes this form (but is several millions of rows long):
import pandas as pd
dict = {'id':["A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"],
'year': ["2000", "2001", "2002", "2000", "2001", "2003", "1999", "2000", "2001", "2000", "2000", "2001"],
'vacation':["France", "Morocco", "Morocco", "Germany", "Germany", "Germany", "Japan", "Australia", "Japan", "Canada", "Mexico", "China"],
'new':[1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(dict)
A 2000 France
A 2001 Morocco
A 2002 Morocco
B 2000 Germany
B 2001 Germany
B 2003 Germany
C 1999 Japan
C 2000 Australia
C 2001 Japan
D 2000 Canada
D 2000 Mexico
D 2001 China
For each person in each year, the holiday destination(s) is/are given; there can be multiple holiday destinations in a given year.
I would like to flag the rows when a participant goes to holiday to a destination to which they had not gone the year before (i.e., the destination is new). In the case above, the output would be:
id year vacation new
A 2000 France 1
A 2001 Morocco 1
A 2002 Morocco 0
B 2001 Germany 1
B 2002 Germany 0
B 2003 Germany 0
C 1999 Japan 1
C 1999 Australia 1
C 2000 Japan 1
D 2000 Canada 1
D 2000 Mexico 1
D 2001 China 1
For A, B, C, and D, the first holiday destination in our data frame is flagged as new. When A goes to Morocco two years in a row, the 2nd occurrence is not flagged, because A went there the year before. When B goes to Germany 3 times in a row, the 2nd and 3rd occurrences are not flagged. When person C goes to Japan twice, all of the occurrences are flagged, because they did not go to Japan two years in a row. D goes to 3 different destinations (albeit to 2 destinations in 2000) and all of them are flagged.
I have been trying to solve it myself, but have not been able to break away from iterations, which are too computationally intensive for such a massive dataset.
I'd appreciate any input; thanks.
答案1
得分: 4
IIUC,
我们正在按id
和vacation
分组,并确保年份与上一年不相等,或者我们可以选择该组合的第一个实例。
希望这清楚了。如果需要更多帮助,请告诉我。
df["new_2"] = (
df.groupby(["id", "vacation"])["id", "year"]
.apply(lambda x: x.ne(x.shift()))
.all(axis=1)
.add(0)
)
print(df)
id year vacation new_2
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
英文:
IIUC,
what we are doing is grouping by id
& vacation
and ensuring that year is not equal to the year above, or we can selecting the first instance of that combination.
hopefully that's clear. let me know if you need anymore help.
df["new_2"] = (
df.groupby(["id", "vacation"])["id", "year"]
.apply(lambda x: x.ne(x.shift()))
.all(axis=1)
.add(0)
)
print(df)
id year vacation new_2
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
答案2
得分: 2
这是我用groupby和transform提出的一个解决方案:
df = df.sort_values(["id", "vacation", "year"])
df["new"] = (
df.groupby(["id", "vacation"])
.transform(lambda x: x.iloc[0])
.year.eq(df.year)
.astype(int)
)
你将会得到以下结果:
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
如果你需要进一步的帮助,请告诉我。
英文:
Here's one solution I came up with, using groupby and transform:
df = df.sort_values(["id", "vacation", "year"])
df["new"] = (
df.groupby(["id", "vacation"])
.transform(lambda x: x.iloc[0])
.year.eq(df.year)
.astype(int)
)
You'll get
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
答案3
得分: 2
这里是使用 groupby+cumcount
和 series.mask
的一种方法:
df['new'] = df.groupby(['id', 'vacation']).cumcount().add(1).mask(lambda x: x.gt(1), 0)
print(df)
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
英文:
Here is a way using groupby+cumcount
and series.mask
:
df['new']=df.groupby(['id','vacation']).cumcount().add(1).mask(lambda x: x.gt(1),0)
print(df)
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论