在pandas时间序列中标记重复的条目。

huangapple go评论79阅读模式
英文:

Flag repeating entries in pandas time series

问题

我理解你的问题。你想要标记那些参与者在某一年去了他们前一年没有去过的度假地点的行。以下是你描述的问题的解决方案:

# 将'year'列转换为日期时间类型
df['year'] = pd.to_datetime(df['year'], format='%Y')

# 根据'id'对数据框进行分组
grouped = df.groupby('id')

# 添加一个'new'列,表示是否是新的度假地点
df['new'] = grouped['vacation'].transform(lambda x: x.ne(x.shift()))

# 输出结果
print(df)

这将给你想要的输出,标记了参与者在某一年去了他们前一年没有去过的度假地点的行。希望这对你有所帮助!

英文:

I have a data frame that takes this form (but is several millions of rows long):

import pandas as pd     
dict = {'id':["A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"], 
    'year': ["2000", "2001", "2002", "2000", "2001", "2003", "1999", "2000", "2001", "2000", "2000", "2001"],
    'vacation':["France", "Morocco", "Morocco", "Germany", "Germany", "Germany", "Japan", "Australia", "Japan", "Canada", "Mexico", "China"],
    'new':[1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]} 
df = pd.DataFrame(dict)

A	2000	France
A	2001	Morocco
A	2002	Morocco
B	2000	Germany
B	2001	Germany
B	2003	Germany
C	1999	Japan
C	2000	Australia
C	2001	Japan
D	2000	Canada	     
D	2000	Mexico  	 
D	2001	China	     

For each person in each year, the holiday destination(s) is/are given; there can be multiple holiday destinations in a given year.
I would like to flag the rows when a participant goes to holiday to a destination to which they had not gone the year before (i.e., the destination is new). In the case above, the output would be:

id	year	vacation	new
A	2000	France	     1
A	2001	Morocco      1
A	2002	Morocco	     0
B	2001	Germany	     1
B	2002	Germany	     0
B	2003	Germany	     0
C	1999	Japan	     1
C	1999	Australia	 1
C	2000	Japan	     1
D	2000	Canada	     1
D	2000	Mexico  	 1
D	2001	China	     1

For A, B, C, and D, the first holiday destination in our data frame is flagged as new. When A goes to Morocco two years in a row, the 2nd occurrence is not flagged, because A went there the year before. When B goes to Germany 3 times in a row, the 2nd and 3rd occurrences are not flagged. When person C goes to Japan twice, all of the occurrences are flagged, because they did not go to Japan two years in a row. D goes to 3 different destinations (albeit to 2 destinations in 2000) and all of them are flagged.

I have been trying to solve it myself, but have not been able to break away from iterations, which are too computationally intensive for such a massive dataset.

I'd appreciate any input; thanks.

答案1

得分: 4

IIUC,

我们正在按idvacation分组,并确保年份与上一年不相等,或者我们可以选择该组合的第一个实例。

希望这清楚了。如果需要更多帮助,请告诉我。

df["new_2"] = (
    df.groupby(["id", "vacation"])["id", "year"]
    .apply(lambda x: x.ne(x.shift()))
    .all(axis=1)
    .add(0)
)

print(df)
  id  year   vacation  new_2
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1
英文:

IIUC,

what we are doing is grouping by id & vacation and ensuring that year is not equal to the year above, or we can selecting the first instance of that combination.

hopefully that's clear. let me know if you need anymore help.

df["new_2"] = (
    df.groupby(["id", "vacation"])["id", "year"]
    .apply(lambda x: x.ne(x.shift()))
    .all(axis=1)
    .add(0)
)

print(df)
  id  year   vacation  new_2
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1

答案2

得分: 2

这是我用groupby和transform提出的一个解决方案:

df = df.sort_values(["id", "vacation", "year"])
df["new"] = (
    df.groupby(["id", "vacation"])
    .transform(lambda x: x.iloc[0])
    .year.eq(df.year)
    .astype(int)
)

你将会得到以下结果:

  id  year   vacation  new
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1

如果你需要进一步的帮助,请告诉我。

英文:

Here's one solution I came up with, using groupby and transform:

df = df.sort_values(["id", "vacation", "year"])
df["new"] = (
    df.groupby(["id", "vacation"])
    .transform(lambda x: x.iloc[0])
    .year.eq(df.year)
    .astype(int)
)

You'll get

  id  year   vacation  new
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1

答案3

得分: 2

这里是使用 groupby+cumcountseries.mask 的一种方法:

df['new'] = df.groupby(['id', 'vacation']).cumcount().add(1).mask(lambda x: x.gt(1), 0)
print(df)

  id  year   vacation  new
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1
英文:

Here is a way using groupby+cumcount and series.mask:

df['new']=df.groupby(['id','vacation']).cumcount().add(1).mask(lambda x: x.gt(1),0)
print(df)

  id  year   vacation  new
0  A  2000     France    1
1  A  2001        USA    1
2  A  2002     France    0
3  B  2001    Germany    1
4  B  2002    Germany    0
5  B  2003    Germany    0
6  C  1999      Japan    1
7  C  2000  Australia    1
8  C  2001     France    1

huangapple
  • 本文由 发表于 2020年1月7日 00:04:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/59615272.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定